JP5541124B2

JP5541124B2 - Language processing device, speech synthesis device, language processing method, and language processing program

Info

Publication number: JP5541124B2
Application number: JP2010267285A
Authority: JP
Inventors: 英樹小島
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-11-30
Filing date: 2010-11-30
Publication date: 2014-07-09
Anticipated expiration: 2030-11-30
Also published as: JP2012118720A

Description

本発明は、言語処理装置、音声合成装置、言語処理方法及び言語処理プログラムに関する。 The present invention relates to a language processing device, a speech synthesizer, a language processing method, and a language processing program.

自然言語処理が各種のマンマシンインタフェース（man machine interface）に組み込まれている。一態様としては、コンピュータ（computer）にテキスト（text）を音声出力させる場合、いわゆるテキストの読み上げに自然言語処理が利用される。テキストの読み上げを実行する場合には、テキストに含まれる文章に形態素解析を行って文章を形態素に分割した上で各形態素に読みを付与する。このようにして形態素解析を実行することによって、テキストとして入力された文字列から表音文字列が生成される。 Natural language processing is incorporated into various man machine interfaces. As one aspect, when a text is output to a computer, natural language processing is used to read out the text. When text is read out, a morpheme analysis is performed on a sentence included in the text to divide the sentence into morphemes, and then reading is given to each morpheme. By executing the morphological analysis in this way, a phonetic character string is generated from a character string input as text.

ここで、一例として、漢字かな混じり文から表音文字列が生成される場合を想定する。このような「漢字かな混じり文」の中には、複数の読みを持つ単語が含まれる場合がある。かかる単語を適切に読み分けるための技術として、共起関係を用いて単語の読み分けを行う言語処理装置が開示されている。例えば、共起関係を用いて「米」という単語を読み分ける場合には、言語処理装置は、「米」という単語が農業関連の単語とともに出現する場合には「こめ」を付与し、国際関連の単語とともに出現する場合には「べい」を付与する。このような言語処理装置の他、中国漢字テーブルや韓国漢字テーブルを用いて、カナ漢字変換辞書を生成したり、人名検索システムを構築したりする人名漢字処理システムも開示されている。 Here, as an example, a case where a phonetic character string is generated from a kanji-kana mixed sentence is assumed. Such a “kanji-kana mixed sentence” may include a word having a plurality of readings. As a technique for appropriately reading such words, a language processing apparatus that reads words using a co-occurrence relationship is disclosed. For example, when using the co-occurrence relationship to distinguish the word “rice”, the language processing device assigns “rice” if the word “rice” appears together with an agriculture-related word. When it appears together with the word, “be” is given. In addition to such a language processing device, a personal name kanji processing system that generates a kana-kanji conversion dictionary or constructs a personal name search system using a Chinese kanji table or a Korean kanji table is also disclosed.

特開２００１−１８４３４５号公報JP 2001-184345 A

しかしながら、上記の従来技術では、テキストの中に漢字表記の他国語が含まれる場合に、他国語を正確に読み上げることができないという問題がある。 However, the above-described conventional technique has a problem in that when a foreign language of Kanji is included in the text, the foreign language cannot be read out accurately.

例えば、上記の言語処理装置は、共起関係を用いて日本語として適切な読みを選択するものに過ぎない。このため、日本語のテキストの中に漢字表記の中国語、韓国語や台湾語が含まれていたとしても、いずれの文字列が他国語であるのかを判別することはできない。よって、上記の言語処理装置を用いてテキストの読み上げを行う場合には、他国語の文字列に誤った日本語の読みを付与してしまう場合もある。 For example, the above language processing apparatus merely selects an appropriate reading as Japanese using a co-occurrence relationship. For this reason, even if the Japanese text contains Chinese, Korean, and Taiwanese kanji characters, it is impossible to determine which character string is in another language. Therefore, when text is read out using the language processing device described above, an erroneous Japanese reading may be given to a character string in another language.

さらに、上記の人名漢字処理システムで構築されたカナ漢字変換辞書を形態素解析に使用したとしても、テキストに含まれる文章のうちいずれの文字列が他国語であるかは依然として判別できない。加えて、テキストにどのような他国語の単語が文字列として含まれるかは未知であり、国ごとに存在する単語の数にも際限がない。よって、上記のカナ漢字変換辞書に必要な他国語の単語を全て登録するのは現実的には困難である。 Furthermore, even if the kana-kanji conversion dictionary constructed by the above-described personal name kanji processing system is used for morphological analysis, it is still impossible to determine which character string is a foreign language among sentences included in the text. In addition, it is unknown what words in other languages are included in the text as character strings, and there is no limit to the number of words that exist in each country. Therefore, it is practically difficult to register all the foreign language words necessary for the kana-kanji conversion dictionary.

開示の技術は、上記に鑑みてなされたものであって、テキストに含まれる漢字表記の他国語を正確に読み上げることができる言語処理装置、音声合成装置、言語処理方法及び言語処理プログラムを提供することを目的とする。 The disclosed technology has been made in view of the above, and provides a language processing device, a speech synthesizer, a language processing method, and a language processing program capable of accurately reading out other languages of kanji notation included in text. For the purpose.

本願の開示する言語処理装置は、所定の辞書を用いて漢字を含んだ文章に形態素解析を実行することにより、前記文章を形態素に分割した上で各形態素に読みを付与する解析部を有する。さらに、前記言語処理装置は、前記解析部による形態素解析の結果から漢字表記の未知語を抽出する抽出部を有する。さらに、前記言語処理装置は、前記抽出部によって抽出された漢字表記の未知語が前記辞書に載る国語とは異なる他国語である確からしさを表す他国語スコアを算出する算出部を有する。さらに、前記言語処理装置は、前記算出部によって算出された他国語スコアに基づいて、前記漢字表記の未知語がいずれの国の単語であるのかを判別する判別部を有する。さらに、前記言語処理装置は、前記判別部による判別結果に応じて、前記漢字表記の未知語の読みを生成する読み生成部を有する。 The language processing apparatus disclosed in the present application includes an analysis unit that divides the sentence into morphemes and gives a reading to each morpheme by executing morpheme analysis on the sentence including kanji using a predetermined dictionary. Furthermore, the language processing apparatus includes an extraction unit that extracts an unknown word expressed in kanji from the result of morphological analysis by the analysis unit. Furthermore, the language processing device has a calculation unit that calculates a foreign language score representing the probability that the unknown word in Chinese characters extracted by the extraction unit is a foreign language different from the national language listed in the dictionary. Furthermore, the language processing apparatus includes a determination unit that determines which country the unknown word of the kanji notation is based on the foreign language score calculated by the calculation unit. Furthermore, the language processing apparatus includes a reading generation unit that generates a reading of an unknown word in the Chinese character notation according to a determination result by the determination unit.

本願の開示する言語処理装置の一つの態様によれば、テキストに含まれる漢字表記の他国語を正確に読み上げることができるという効果を奏する。 According to one aspect of the language processing device disclosed in the present application, there is an effect that it is possible to accurately read out another language of the kanji notation included in the text.

図１は、実施例１に係る音声合成装置の構成を示す図である。FIG. 1 is a diagram illustrating the configuration of the speech synthesizer according to the first embodiment. 図２は、形態素辞書記憶部に記憶される情報の構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of information stored in the morpheme dictionary storage unit. 図３Ａは、共起辞書記憶部に記憶される中国語用の共起辞書の構成例を示す図である。FIG. 3A is a diagram illustrating a configuration example of a Chinese co-occurrence dictionary stored in the co-occurrence dictionary storage unit. 図３Ｂは、共起辞書記憶部に記憶される韓国語用の共起辞書の構成例を示す図である。FIG. 3B is a diagram illustrating a configuration example of a Korean co-occurrence dictionary stored in the co-occurrence dictionary storage unit. 図４は、日本語辞書記憶部に記憶される情報の構成例を示す図である。FIG. 4 is a diagram illustrating a configuration example of information stored in the Japanese dictionary storage unit. 図５は、他国語辞書記憶部によって記憶される韓国語用の漢字読み辞書の構成例を示す図である。FIG. 5 is a diagram illustrating a configuration example of a Korean kanji reading dictionary stored in the foreign language dictionary storage unit. 図６は、共起辞書の作成方法を説明するための図である。FIG. 6 is a diagram for explaining a method for creating a co-occurrence dictionary. 図７は、アクセント付与の一例を示す図である。FIG. 7 is a diagram illustrating an example of accent assignment. 図８は、実施例１に係る読み生成処理の手順を示すフローチャートである。FIG. 8 is a flowchart illustrating the procedure of the reading generation process according to the first embodiment. 図９は、実施例１に係る中国語スコア算出処理の手順を示すフローチャートである。FIG. 9 is a flowchart illustrating the procedure of the Chinese score calculation process according to the first embodiment. 図１０は、実施例２に係る音声合成装置の構成を示すブロック図である。FIG. 10 is a block diagram illustrating the configuration of the speech synthesizer according to the second embodiment. 図１１は、単漢字辞書記憶部に記憶される情報の構成例を示す図である。FIG. 11 is a diagram illustrating a configuration example of information stored in the single kanji dictionary storage unit. 図１２は、単漢字辞書の作成方法を説明するための図である。FIG. 12 is a diagram for explaining a method of creating a single kanji dictionary. 図１３は、実施例２に係る中国語スコア算出処理の手順を示すフローチャートである。FIG. 13 is a flowchart illustrating a procedure of Chinese score calculation processing according to the second embodiment. 図１４は、実施例３に係る言語処理プログラムを実行するコンピュータの一例について説明するための図である。FIG. 14 is a schematic diagram illustrating an example of a computer that executes a language processing program according to the third embodiment.

以下に、本願の開示する言語処理装置、音声合成装置、言語処理方法及び言語処理プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例は開示の技術を限定するものではない。そして、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Hereinafter, embodiments of a language processing device, a speech synthesis device, a language processing method, and a language processing program disclosed in the present application will be described in detail with reference to the drawings. Note that this embodiment does not limit the disclosed technology. Each embodiment can be appropriately combined within a range in which processing contents are not contradictory.

［音声合成装置の構成］
図１は、実施例１に係る音声合成装置の構成を示す図である。図１に示す音声合成装置１０は、入力されたテキスト（text）を音声出力する処理、いわゆるテキストの読み上げを実行するものであり、とりわけテキストの中に含まれる漢字表記の他国語を正確に読み上げるものである。 [Configuration of speech synthesizer]
FIG. 1 is a diagram illustrating the configuration of the speech synthesizer according to the first embodiment. A speech synthesizer 10 shown in FIG. 1 performs processing for outputting input text (text) as a voice, that is, so-called text-to-speech reading. Is.

すなわち、本実施例に係る音声合成装置１０は、テキストの中に含まれる漢字表記の未知語が日本語以外のいずれの他国語であるのかを他国語スコアにより判別した上で未知語の読みを生成する。それゆえ、本実施例に係る音声合成装置１０では、漢字表記の未知語が他国語である確からしさを他国語スコアとして評価した上で他国語スコアが高い他国語の読みを未知語の読みとして生成できる。このため、本実施例に係る音声合成装置１０では、日本語のテキストの中に漢字表記の中国語、韓国語や台湾語などの他国語が含まれていたとしても、他国語の文字列に誤った日本語の読みを付与することを防止できる。よって、本実施例に係る音声合成装置１０によれば、テキストに含まれる漢字表記の他国語を正確に読み上げることが可能になる。 That is, the speech synthesizer 10 according to the present embodiment reads the unknown word after determining which foreign language other than Japanese is the unknown word in the kanji notation included in the text by the foreign language score. Generate. Therefore, in the speech synthesizer 10 according to the present embodiment, the probabilities that an unknown word in Kanji is a foreign language is evaluated as the foreign language score, and then the reading of the foreign language having a high foreign language score is used as the unknown word reading. Can be generated. For this reason, in the speech synthesizer 10 according to the present embodiment, even if the Japanese text includes other languages such as Chinese, Korean, Taiwanese, etc., the character strings of the other languages are included. It can prevent giving wrong Japanese readings. Therefore, according to the speech synthesizer 10 according to the present embodiment, it is possible to accurately read out the other language of the Chinese character notation included in the text.

なお、図１の例では、テキストの読み上げ機能をパーソナルコンピュータ（ＰＣ：Personal Computer）に実装する場合を想定して以下の説明を行うが、開示の装置はこれに限らず、あらゆる情報処理装置に適用できる。かかる情報処理装置の一例としては、携帯電話機、ＰＨＳ（Personal Handyphone System）、ＰＤＡ（Personal Digital Assistant）やカーナビゲーションシステム（car navigation system）などが挙げられる。 In the example of FIG. 1, the following description is given assuming that the text-to-speech function is implemented in a personal computer (PC). However, the disclosed apparatus is not limited to this, and any information processing apparatus can be used. Applicable. Examples of the information processing apparatus include a mobile phone, a PHS (Personal Handyphone System), a PDA (Personal Digital Assistant), a car navigation system, and the like.

図１に示す音声合成装置１０は、受付部１１と、形態素辞書記憶部１２と、共起辞書記憶部１３と、日本語辞書記憶部１４ａと、他国語辞書記憶部１４ｂと、コーパス記憶部１５ａと、作成部１５とを有する。さらに、音声合成装置１０は、言語処理部１６と、韻律生成部１７と、合成部１８と、出力部１９とを有する。なお、音声合成装置１０は、図１に示した機能部以外にも既知のコンピュータが有する各種の機能部を有するものとする。一例としては、キーボード（keyboard）やマウス（mouse）などの入力デバイス（device）が挙げられる。他の一例としては、モニタ（monitor）、ディスプレイ（display）やタッチパネル（touch panel）などの表示デバイスが挙げられる。更なる一例としては、外部装置との間で通信を行うためのインタフェース（interface）が挙げられる。 The speech synthesizer 10 shown in FIG. 1 includes a reception unit 11, a morpheme dictionary storage unit 12, a co-occurrence dictionary storage unit 13, a Japanese dictionary storage unit 14a, a foreign language dictionary storage unit 14b, and a corpus storage unit 15a. And a creation unit 15. Furthermore, the speech synthesizer 10 includes a language processing unit 16, a prosody generation unit 17, a synthesis unit 18, and an output unit 19. Note that the speech synthesizer 10 includes various functional units included in a known computer in addition to the functional units illustrated in FIG. An example is an input device such as a keyboard or a mouse. Other examples include display devices such as a monitor, a display, and a touch panel. A further example is an interface for communicating with an external device.

受付部１１は、テキストの入力を受け付ける処理部である。一例としては、受付部１１は、図示しないブラウザ（web browser）によって取得されたウェブページ（web page）のうちテキストデータ（text data）を入力テキストとして受け付ける。他の一例としては、受付部１１は、図示しない入力デバイスを介して指定されたテキストファイルを入力テキストとして受け付ける。更なる一例としては、受付部１１は、図示しないアプリケーションプログラム（application program）からフック（hook）したデータをテキストデータに変換した上で入力テキストとして受け付ける。 The reception unit 11 is a processing unit that receives text input. As an example, the reception unit 11 receives text data as text input from a web page acquired by a browser (not shown). As another example, the accepting unit 11 accepts a text file designated via an input device (not shown) as input text. As a further example, the receiving unit 11 converts data hooked from an application program (not shown) into text data and receives it as input text.

また、受付部１１は、受け付けた入力テキストを後述の言語処理部１６によって処理を実行させる単位に分割する。一例としては、受付部１１は、句点、疑問符や感嘆符などの区切り記号を検出する度に入力テキストに含まれる文字列を区切ることにより、入力テキストを１文ずつ後段の言語処理部１６へ出力する。なお、ここでは、後述の言語処理部１６に１文ずつ出力する場合を想定して以下の説明を行うが、文字数やデータサイズなどの任意の単位で入力テキストを後段の言語処理部１６へ出力できる。 In addition, the reception unit 11 divides the received input text into units in which processing is performed by the language processing unit 16 described later. As an example, the reception unit 11 outputs the input text to the subsequent language processing unit 16 one sentence at a time by separating a character string included in the input text each time a delimiter such as a punctuation mark, a question mark, or an exclamation mark is detected. To do. Here, the following explanation is given assuming that one sentence is output to the language processing unit 16 to be described later, but the input text is output to the subsequent language processing unit 16 in arbitrary units such as the number of characters and the data size. it can.

形態素辞書記憶部１２は、後述の解析部１６ａによる形態素解析に使用される辞書を記憶する記憶部である。一態様としては、形態素辞書記憶部１２は、形態素および読みを対応付けて記憶する。ここで言う「形態素」とは、文章の要素のうち意味を持つ最小の単位を指す。なお、形態素辞書記憶部１２に記憶された「読み」は、後述の抽出部１６ｂによって未知語として抽出されなかった形態素や後述の判別部１６ｄによって漢字表記の未知語が日本語であると判別された場合に採用される。 The morpheme dictionary storage unit 12 is a storage unit that stores a dictionary used for morpheme analysis by an analysis unit 16a described later. As one aspect, the morpheme dictionary storage unit 12 stores morphemes and readings in association with each other. The “morpheme” here refers to the smallest meaningful unit among the elements of the sentence. Note that the “reading” stored in the morpheme dictionary storage unit 12 is determined that the morpheme that has not been extracted as an unknown word by the extraction unit 16b described later or the unknown word in kanji notation is Japanese by the determination unit 16d described later. It is adopted when

図２は、形態素辞書記憶部に記憶される情報の構成例を示す図である。図２の例では、形態素「中国」を「チューゴク」と読み、形態素「の」を「ノ」と読み、形態素「を」を「オ」と読み、また、形態素「訪問」を「ホーモン」と読むことを示す。さらに、図２の例では、形態素「する」を「スル」と読み、形態素「成」を「セイ」と読み、形態素「都」を「ミヤコ」と読み、形態素「釜」を「カマ」と読み、また、形態素「山」を「ヤマ」と読むことを示す。このうち、「成」、「都」、「釜」、「山」は、後述の解析部１６ａによって他の漢字とともに述語を形成しない単漢字であると形態素解析された場合には、後述の抽出部１６ｂによって未知語として抽出される。なお、図２に示した形態素の辞書は、あくまでも一例であり、あらゆる品詞の読みを任意の個数登録できる。 FIG. 2 is a diagram illustrating a configuration example of information stored in the morpheme dictionary storage unit. In the example of FIG. 2, the morpheme “China” is read as “Chugok”, the morpheme “no” is read as “no”, the morpheme “is” is read as “o”, and the morpheme “visit” is read as “homon”. Indicates reading. Furthermore, in the example of FIG. 2, the morpheme “s” is read as “sul”, the morpheme “sei” is read as “sei”, the morpheme “city” is read as “miyako”, and the morpheme “kama” is called “kama”. Indicates that the morpheme “mountain” is read as “Yama”. Among these, “sei”, “tou”, “kama”, and “yama” are extracted as described later when the analysis unit 16a described later is a single kanji character that does not form a predicate with other kanji characters. It is extracted as an unknown word by the part 16b. The morpheme dictionary shown in FIG. 2 is merely an example, and any number of parts of speech can be registered.

共起辞書記憶部１３は、他国語および日本語の共起関係が定義された共起辞書を記憶する記憶部である。かかる共起辞書は、他国語スコアの算出に用いられるものであり、後述の抽出部１６ｂによって未知語が抽出された文章に含まれる単語に対応付けられた共起語の対数尤度が後述の算出部１６ｃによって他国語スコアの算出に用いられる。 The co-occurrence dictionary storage unit 13 is a storage unit that stores a co-occurrence dictionary in which co-occurrence relationships between other languages and Japanese are defined. Such a co-occurrence dictionary is used to calculate a foreign language score, and the log likelihood of a co-occurrence word associated with a word included in a sentence from which an unknown word is extracted by an extraction unit 16b described later is described later. It is used for calculation of a foreign language score by the calculation unit 16c.

一態様としては、共起辞書記憶部１３は、共起語の種類ごとに表記および対数尤度を対応付けて記憶する。ここで言う「共起語の種類」とは、共起辞書の作成時に中国語の名詞が出現する文章に含まれる単語が共起語として共起辞書に登録されるにあたってその単語が中国語の名詞に対して占めていた位置に関する種類を指す。例えば、中国語の名詞の直後以外の位置に単語が出現していた場合には、共起語の種類は「一般共起語」として分類される。また、中国語の名詞の直後に単語が出現していた場合には、共起語の種類は「直後の語」として分類される。また、「対数尤度」は、共起語とともに文章に出現している未知語が他国語である確からしさを表す。 As one aspect, the co-occurrence dictionary storage unit 13 stores a notation and a log likelihood in association with each type of co-occurrence words. The term “type of co-occurrence words” as used herein means that a word included in a sentence in which a Chinese noun appears when a co-occurrence dictionary is created is registered in the co-occurrence dictionary as a co-occurrence word. Refers to the type of position occupied by the noun. For example, if a word appears at a position other than immediately after a Chinese noun, the type of co-occurrence word is classified as “general co-occurrence word”. When a word appears immediately after a Chinese noun, the type of co-occurrence word is classified as “immediate word”. In addition, “log likelihood” represents the probability that an unknown word appearing in a sentence together with a co-occurrence word is a foreign language.

ここで、共起辞書の一例として、中国語用の共起辞書を例示する。図３Ａは、共起辞書記憶部に記憶される中国語用の共起辞書の構成例を示す図である。図３Ａの例では、一般共起語として「中国」、「主席」、「来日」が登録されており、各々の対数尤度が「2.322」、「1.874」、「1.246」であることを示す。さらに、図３Ａの例では、直後の語として「中主席」が登録されており、その対数尤度が「4.822」であることを示す。なお、詳細は図６を用いて後述するが、共起辞書の作成時には、一般共起語よりも直後の語の方が中国語の名詞との関連性が高いので、直後の語の対数尤度の方が一般共起語の対数尤度よりも大きい重み付けがなされて共起辞書に登録される。 Here, a Chinese co-occurrence dictionary is illustrated as an example of the co-occurrence dictionary. FIG. 3A is a diagram illustrating a configuration example of a Chinese co-occurrence dictionary stored in the co-occurrence dictionary storage unit. In the example of FIG. 3A, “China”, “Principal”, and “Visit Japan” are registered as common co-occurrence words, and the log likelihood of each is “2.322”, “1.874”, “1.246”. Show. Further, in the example of FIG. 3A, “middle president” is registered as the immediately following word, and the log likelihood is “4.822”. Although details will be described later with reference to FIG. 6, when the co-occurrence dictionary is created, the word immediately after is more related to the Chinese noun than the general co-occurrence word. The degree is weighted larger than the log likelihood of the general co-occurrence word and registered in the co-occurrence dictionary.

さらに、共起辞書の一例として、韓国語用の共起辞書を例示する。図３Ｂは、共起辞書記憶部に記憶される韓国語用の共起辞書の構成例を示す図である。図３Ｂの例では、一般共起語として「韓国」、「北朝鮮」、「来日」が登録されており、各々の対数尤度が「2.401」、「2.521」、「1.312」であることを示す。さらに、図３Ｂの例では、直後の語として「韓国」が登録されており、その対数尤度が「4.972」であることを示す。なお、ここでは、中国語用および韓国語用の共起辞書を例示したが、それ以外にも台湾語のように漢字を含んだ国語であれば任意の他国語の共起辞書を準備できる。 Furthermore, a Korean co-occurrence dictionary is illustrated as an example of a co-occurrence dictionary. FIG. 3B is a diagram illustrating a configuration example of a Korean co-occurrence dictionary stored in the co-occurrence dictionary storage unit. In the example of FIG. 3B, “Korea”, “North Korea”, and “Visit Japan” are registered as common co-occurrence words, and their log likelihoods are “2.401”, “2.521”, and “1.312”, respectively. Indicates. Further, in the example of FIG. 3B, “Korea” is registered as the immediately following word, and the log likelihood is “4.972”. Here, the co-occurrence dictionaries for Chinese and Korean are illustrated here, but any other co-occurrence dictionaries can be prepared as long as the language includes Chinese characters such as Taiwanese.

日本語辞書記憶部１４ａは、日本語の音読み辞書を記憶する記憶部である。かかる日本語の音読み辞書は、後述の判別部１６ｄによって漢字表記の未知語が中国語であると判別された場合に、未知語の読みとして日本語の音読みを付与するために、後述の読み生成部１６ｅによって用いられる。ここで、後述の読み生成部１６ｅによって未知語として日本語の音読みを生成させるか、あるいは中国語の読みを生成させるかは、図示しない入力デバイスを介して漢字を中国語で読む「中国語読みモード」のＯＮまたはＯＦＦの指定により選択できる。 The Japanese dictionary storage unit 14a is a storage unit that stores a Japanese phonetic dictionary. Such a Japanese phonetic dictionary is used to generate a phonetic reading described later in order to give a Japanese phonetic reading as an unknown word reading when the determining unit 16d described later determines that an unknown word in Chinese characters is Chinese. Used by part 16e. Here, whether to generate Japanese phonetic reading as an unknown word or to generate Chinese reading by the reading generation unit 16e described later determines whether to read Chinese characters in Chinese via an input device (not shown). The mode can be selected by specifying ON or OFF of “Mode”.

一態様として、日本語辞書記憶部１４ａは、漢字および音読みを対応付けて記憶する。図４は、日本語辞書記憶部に記憶される情報の構成例を示す図である。図４の例では、漢字「成」の音読みが「セー」であり、漢字「都」の音読みが「ト」であり、漢字「李」の音読みが「リ」であり、また、漢字「劉」の音読みが「リュウ」であることを示す。なお、図４に示した日本語の音読み辞書は、あくまでも一例であり、あらゆる漢字の読みを任意の個数登録できる。 As an aspect, the Japanese dictionary storage unit 14a stores kanji and reading aloud in association with each other. FIG. 4 is a diagram illustrating a configuration example of information stored in the Japanese dictionary storage unit. In the example of FIG. 4, the reading of the kanji “sei” is “se”, the reading of the kanji “to” is “t”, the reading of the kanji “li” is “li”, and the kanji “liu” "" Indicates that "Ryu" is read. Note that the Japanese phonetic reading dictionary shown in FIG. 4 is merely an example, and any number of kanji readings can be registered.

他国語辞書記憶部１４ｂは、他国語の漢字の読みに関する辞書を記憶する記憶部である。かかる他国語辞書は、後述の判別部１６ｄによって漢字表記の未知語が日本語以外の他国語であると判別された場合に後述の読み生成部１６ｅによって用いられる。なお、漢字表記の未知語が中国語であると判別された場合には、「中国語読みモード」がＯＮに指定されていることを条件に他国語辞書記憶部１４ｂが後述の読み生成部１６ｅによって使用される。 The foreign language dictionary storage unit 14b is a storage unit that stores a dictionary related to reading kanji in other languages. Such a foreign language dictionary is used by a later-described reading generation unit 16e when the later-described determining unit 16d determines that an unknown word in Chinese characters is a foreign language other than Japanese. If it is determined that the unknown word in Chinese characters is Chinese, the foreign language dictionary storage unit 14b will read the later-described reading generation unit 16e on the condition that “Chinese reading mode” is set to ON. Used by.

ここで、他国語辞書の一例として、韓国語用の漢字読み辞書を例示する。図５は、他国語辞書記憶部によって記憶される韓国語用の漢字読み辞書の構成例を示す図である。図５の例では、漢字「金」の韓国語での読みが「キム」であり、漢字「釜」の韓国語での読みが「プ」であり、漢字「山」の韓国語での読みが「サン」であり、また、漢字「李」の韓国語での読みが「イ」であることを示す。なお、図５に示した韓国語用の漢字読み辞書は、あくまでも一例であり、あらゆる漢字の読みを任意の個数登録できる。また、ここでは、漢字および読みを対応付けて記憶する例を説明したが、さらに読みのアクセントが対応付けて記憶されていてもよい。また、ここでは、韓国語用の漢字読み辞書を例示したが、中国語用の漢字読み辞書も同様の構成により実現できる。 Here, as an example of the foreign language dictionary, a kanji reading dictionary for Korean is illustrated. FIG. 5 is a diagram illustrating a configuration example of a Korean kanji reading dictionary stored in the foreign language dictionary storage unit. In the example of Fig. 5, the reading of the Chinese character "Kin" in Korean is "Kim", the reading of the Chinese character "Kama" in Korean is "Pu", and the reading of the Chinese character "Yama" in Korean. Indicates “San” and the reading of the Chinese character “Li” in Korean is “I”. Note that the Korean kanji reading dictionary shown in FIG. 5 is merely an example, and any number of kanji readings can be registered. Also, here, an example has been described in which kanji and readings are stored in association with each other, but reading accents may be stored in association with each other. In addition, here, the Kanji reading dictionary for Korean has been exemplified, but the Chinese character reading dictionary for Chinese can also be realized by the same configuration.

なお、本実施例では、漢字表記の未知語が中国語である場合に「中国語読みモード」がＯＮ又はＯＦＦであるか否かにより未知語の読みを生成するのに使用する辞書を日本語の音読み辞書と他国語辞書との間で切り替える場合を例示したが、この例に限定されない。例えば、日本語の音読み辞書または他国語辞書のいずれかを排他的に使用することとしてもよく、この場合には使用しない方の辞書を保持せずともよくなる。 In this embodiment, when an unknown word in Chinese characters is Chinese, a dictionary used to generate an unknown word reading is determined depending on whether the “Chinese reading mode” is ON or OFF. However, the present invention is not limited to this example. For example, either a Japanese phonetic dictionary or a foreign language dictionary may be used exclusively, and in this case, it is not necessary to maintain a dictionary that is not used.

コーパス記憶部１５ａは、日本語のコーパス（corpus）、すなわち大規模な言語資料を記憶する記憶部である。かかるコーパスは、後述の作成部１５によって共起辞書が作成される場合に使用される。このため、コーパスとして準備する文章の量が多いほど、後述の算出部１６ｃによって算出される他国語スコアの精度を高めることができる。また、共起辞書を作成するには、他国語および日本語が共起関係を有する文章が有効なサンプルとなる。それゆえ、コーパスには、中国語の名詞や韓国語などの他国語の名詞がより多く含まれていることが好ましい。 The corpus storage unit 15a is a storage unit that stores a Japanese corpus, that is, a large-scale language material. Such a corpus is used when a co-occurrence dictionary is created by the creation unit 15 described later. For this reason, the accuracy of the foreign language score calculated by the calculation part 16c mentioned later can be improved, so that there are many sentences prepared as a corpus. In order to create a co-occurrence dictionary, sentences having a co-occurrence relationship between other languages and Japanese are effective samples. Therefore, it is preferable that the corpus includes more nouns in other languages such as Chinese nouns and Korean.

作成部１５は、コーパス記憶部１５ａを用いて、共起辞書を作成する処理部である。この共起辞書の作成方法を図６を用いて説明する。図６は、共起辞書の作成方法を説明するための図である。なお、図６の例では、中国語用の共起辞書および韓国語用の共起辞書を作成する場合を例示するが、台湾語のように漢字を含んだ国語の共起辞書を作成する場合にも同様に適用できる。 The creation unit 15 is a processing unit that creates a co-occurrence dictionary using the corpus storage unit 15a. A method for creating the co-occurrence dictionary will be described with reference to FIG. FIG. 6 is a diagram for explaining a method for creating a co-occurrence dictionary. The example in FIG. 6 illustrates the case where a Chinese co-occurrence dictionary and a Korean co-occurrence dictionary are created. However, a national language co-occurrence dictionary including Chinese characters such as Taiwanese is created. The same applies to the above.

図６に示すように、作成部１５は、コーパス記憶部１５ａから読み出した日本語のコーパスに形態素解析を実行する（ステップＳ５１）。一例として、コーパス記憶部１５ａに「こんにちは。…中国の北京へ行きます。…北朝鮮の平壌で１７日、…」というコーパスが保持されていた場合を想定する。この場合には、作成部１５は、コーパスに含まれる各文章の形態素解析の結果として、次のような結果を得る。すなわち、作成部１５は、「こんにちは（感動詞）。…中国（名詞）・の（助詞）・北京（中国語名詞）・へ（助詞）・行き（動詞）・ます（助動詞）。…北朝鮮（名詞）・の（助詞）・平壌（韓国語名詞）・で（助詞）・１７（数詞）・日（数助詞）、…」を得る。なお、上記の括弧内の中点「・」は、形態素と形態素との区切りを指すものとする。 As shown in FIG. 6, the creation unit 15 performs morphological analysis on the Japanese corpus read from the corpus storage unit 15a (step S51). As an example, in the corpus storage unit 15a "Hello. ... You go to China in Beijing. ... North Korea in Pyongyang 17 days, ..." corpus that is assumed a case that has been held. In this case, the creation unit 15 obtains the following result as a result of morphological analysis of each sentence included in the corpus. In other words, the creation unit 15, "Hello (interjection). ... China (noun) • (particle), Beijing (Chinese noun) to - (particle) go (verb) mass (auxiliary verb). ... North Korea (Noun), No (Participant), Pyongyang (Korean noun), Deta (Participant), 17 (Numeric), Sun (Numeric Particle), ... ". In addition, the middle point “•” in the parentheses indicates a break between a morpheme and a morpheme.

続いて、作成部１５は、先の形態素解析により得た結果のうち、中国語名詞を含む文章のみを抽出してデータベースを作成する（ステップＳＡ５２）。例えば、作成部１５は、中国語名詞「北京」を含む文章「中国（名詞）・の（助詞）・北京（中国語名詞）・へ（助詞）・行き（動詞）・ます（助動詞）」を抽出する。かかる文章の抽出にあたって中国語名詞を検出するには、中国語の国語辞書に載っている名詞と突き合わせすることにより検出することとしてもよいし、共起辞書の設計者による指示操作でコーパスに含まれる中国語名詞に予めマーキングさせることとしてもよい。なお、ここでは、中国語名詞を含む文章として１つの文章を例示したが、実際には多数の文章が抽出されるものとする。 Subsequently, the creation unit 15 creates a database by extracting only sentences including Chinese nouns from the results obtained by the previous morphological analysis (step SA52). For example, the preparation unit 15 includes a sentence “China (noun), no (particle), Beijing (Chinese noun), he (particle), go (verb), and mas (auxiliary verb)” including the Chinese noun “Beijing”. Extract. In order to detect Chinese nouns in the extraction of such sentences, it may be detected by matching with nouns in the Chinese language dictionary or included in the corpus by an instruction operation by the co-occurrence dictionary designer. It is also possible to mark in advance a Chinese noun. Here, although one sentence is illustrated as a sentence including Chinese nouns, it is assumed that a large number of sentences are actually extracted.

そして、作成部１５は、中国語名詞を含む文章のデータベースから、中国語名詞の直後の語の集合、すなわち直後の語の集合を生成する（ステップＳＡ５３−１）。例えば、直後の語の集合としては、上記の１文の例では「へ（助詞）」が生成される他、「に（助詞）…・中主席（名詞）・中首相（名詞）…」などが生成されるものとする。 Then, the creation unit 15 generates a set of words immediately after the Chinese noun, that is, a set of words immediately after the Chinese noun, from the database of sentences including the Chinese noun (step SA53-1). For example, as the set of words immediately after, in the example of the above sentence, “he (particle)” is generated, and “ni (particle)…. Is generated.

また、作成部１５は、中国語名詞を含む文章のデータベースから、中国語名詞およびその直後の語を除く単語の集合、すなわち一般共起語の集合を生成する（ステップＳＡ５３−２）。例えば、一般共起語の集合としては、「…中国（名詞）・の（助詞）・行き（動詞）・ます（助動詞）…」などが生成される。 Further, the creating unit 15 generates a set of words excluding the Chinese noun and the word immediately after it, that is, a set of general co-occurrence words, from the database of sentences including the Chinese noun (step SA53-2). For example, as a set of general co-occurrence words, "... China (noun), no (particle), go (verb), mas (auxiliary verb) ..." and the like are generated.

続いて、作成部１５は、先に生成した２つの集合ごとに各単語が自身の所属する集合に出現する頻度、以下「出現頻度」と記載を算出する（ステップＳＡ５４−１及びステップＳＡ５４−２）。図６の例で言えば、直後の語「へ（助詞）」の出現頻度として「2012（回）」が算出され、直後の語「に（助詞）」の出現頻度として「1893（回）」が算出される。さらに、直後の語「中主席（名詞）」の出現頻度として「203（回）」が算出され、直後の語「中首相（名詞）」の出現頻度として「159（回）」が算出される。一方、一般共起語「が（助詞）」の出現頻度として「4183（回）」が算出され、また、一般共起語「は（助詞）」の出現頻度として「4024（回）」が算出される。さらに、一般共起語「中国（名詞）」の出現頻度として「176（回）」が算出され、一般共起語「主席（名詞）」の出現頻度として「165（回）」が算出され、一般共起語「来日（名詞）」の出現頻度として「162（回）」が算出される。なお、ここでは、出現頻度として単語が出現する回数を算出する場合を例示したが、各単語が自身の所属する集合に出現する割合を出現頻度として算出することとしてもよい。 Subsequently, the creation unit 15 calculates the frequency at which each word appears in the set to which the word belongs, for each of the two sets generated earlier, hereinafter referred to as “appearance frequency” (step SA54-1 and step SA54-2). ). In the example of FIG. 6, “2012 (times)” is calculated as the frequency of occurrence of the word “he (particle)” immediately after, and “1893 (times)” as the frequency of occurrence of the word “ni (particle)” immediately after. Is calculated. Furthermore, “203 (times)” is calculated as the appearance frequency of the word “middle principal (noun)” immediately after, and “159 (times)” is calculated as the appearance frequency of the word “middle prime minister (noun)” immediately after. . On the other hand, “4183 (times)” is calculated as the appearance frequency of the general co-occurrence word “ga (particle)”, and “4024 (times)” is calculated as the appearance frequency of the general co-occurrence word “ha (particle)”. Is done. Furthermore, “176 (times)” is calculated as the appearance frequency of the general co-occurrence word “China (noun)”, “165 (times)” is calculated as the appearance frequency of the general co-occurrence word “lead (noun)” “162 (times)” is calculated as the appearance frequency of the general co-occurrence word “visit to Japan (noun)”. In addition, although the case where the frequency | count that a word appears as an appearance frequency was illustrated here was illustrated, it is good also as calculating the ratio that each word appears in the set to which it belongs as an appearance frequency.

ここで、直後の語および一般共起語の出現頻度の上位に現れる単語は、助詞や助動詞などの付属語であり、これらの付属語は中国語名詞との相関性は低い。よって、以降の処理では、直後の語の集合および一般共起語の集合のうち品詞が名詞である単語を共起辞書に載せる対象とする。これによって、中国語名詞と相関性が高い単語を優先して共起辞書に登録でき、それを用いて算出される他国語スコアの信頼性を高めることができる。 Here, the words appearing at the top of the appearance frequency of the immediately following word and general co-occurrence words are adjuncts such as particles and auxiliary verbs, and these adjuncts have low correlation with Chinese nouns. Therefore, in the subsequent processing, a word whose part of speech is a noun from a set of words immediately after and a set of general co-occurrence words is set as an object to be placed in the co-occurrence dictionary. As a result, it is possible to preferentially register words having high correlation with Chinese nouns in the co-occurrence dictionary, and it is possible to improve the reliability of the foreign language score calculated using the words.

そして、作成部１５は、直後の語および一般共起語の集合ごとに各単語の対数尤度を算出する（ステップＳＡ５５−１及びステップＳＡ５５−２）。かかる対数尤度には、一例として、算出式「対数尤度＝ｌｏｇ（出現頻度＊１００％／文章の数）」が用いられる。上記の算出式の「対数の底」には、任意のものを適用できるが、一例としては、自然対数を用いるのが好ましい。また、上記の算出式の「文章の数」とは、中国語名詞を含む文章のデータベースに格納されている文章の数を指す。図６の例で言えば、直後の語「中主席（名詞）」の対数尤度として「2.411」が算出され、直後の語「中首相（名詞）」の対数尤度として「1.156」が算出される。一方、一般共起語「中国（名詞）」の対数尤度として「2.322」が算出され、一般共起語「主席（名詞）」の対数尤度として「1.874」が算出され、一般共起語「来日（名詞）」の対数尤度として「1.246」が算出される。 Then, the creation unit 15 calculates the log likelihood of each word for each set of immediately following words and general co-occurrence words (steps SA55-1 and SA55-2). As an example of the log likelihood, a calculation formula “log likelihood = log (appearance frequency * 100% / number of sentences)” is used. Any value can be applied to the “base of logarithm” of the above calculation formula, but as an example, it is preferable to use a natural logarithm. The “number of sentences” in the above calculation formula indicates the number of sentences stored in a database of sentences including Chinese nouns. In the example of FIG. 6, “2.411” is calculated as the log likelihood of the word “middle principal (noun)” immediately after, and “1.156” is calculated as the log likelihood of the word “middle prime minister (noun)” immediately after. Is done. On the other hand, “2.322” is calculated as the log likelihood of the general co-occurrence word “China (noun)”, and “1.874” is calculated as the log likelihood of the general co-occurrence word “lead (noun)”. “1.246” is calculated as the log likelihood of “visit to Japan (noun)”.

その後、作成部１５は、直後の語の対数尤度および一般共起語の対数尤度のうち直後の語の対数尤度に一般共起語の対数尤度よりも大きい重みを付与する重み付け処理を行った上で中国語用の共起辞書として共起辞書記憶部１３に登録する（ステップＳＡ５６）。かかる重み付けの一例としては、作成部１５は、直後の語の対数尤度の重みを一般共起語の対数尤度の２倍とし、直後の語の対数尤度に「２」を乗算する。かかる重み付けによって、中国語名詞との相関性が一般共起語よりも高い直後の語の対数尤度が他国語スコアの算出に反映される割合が高まる結果、他国語スコアの信頼性を高めることができる。なお、対数尤度の重み付けは、上記の例には限定されない。すなわち、直後の語の対数尤度に付与する重みが一般共起語の対数尤度に付与する重みよりも高ければよく、それぞれの対数尤度への重みには任意の値を付与できる。 Thereafter, the creating unit 15 assigns a weight greater than the log likelihood of the general co-occurrence word to the log likelihood of the immediately subsequent word among the log likelihood of the immediately subsequent word and the log likelihood of the general co-occurrence word. Is registered in the co-occurrence dictionary storage unit 13 as a Chinese co-occurrence dictionary (step SA56). As an example of such weighting, the creation unit 15 sets the log likelihood of the immediately following word to twice the log likelihood of the general co-occurrence word, and multiplies the log likelihood of the immediately following word by “2”. Such weighting increases the proportion of the log likelihood of the immediately following word that has a higher correlation with Chinese nouns than the general co-occurrence word, which is reflected in the calculation of the foreign language score, thereby improving the reliability of the foreign language score. Can do. Note that the log likelihood weighting is not limited to the above example. That is, the weight given to the log likelihood of the immediately following word only needs to be higher than the weight given to the log likelihood of the general co-occurrence word, and an arbitrary value can be given to the weight to each log likelihood.

このように、上記のステップＳ５１〜ステップＳＡ５６までの処理により、図３Ａに示した中国語用の共起辞書が作成される。これによって、漢字表記の未知語が中国語である確からしさを評価するための評価基準を定義することができる。なお、上記のステップＳＡ５３−１〜ＳＡ５５−１の処理と、上記のステップＳＡ５３−２〜ＳＡ５５−２の処理とは、両者を並列して実行することもできるし、いずれを先または後として処理を実行することとしてもかまわない。 As described above, the Chinese co-occurrence dictionary shown in FIG. 3A is created by the processing from step S51 to step SA56. This makes it possible to define an evaluation criterion for evaluating the probability that an unknown word in Chinese characters is Chinese. In addition, the process of said step SA53-1-SA55-1 and the process of said step SA53-2-SA55-2 can also be performed in parallel, and either is processed before or after. It does not matter as executing.

一方、作成部１５は、韓国語用の共起辞書についても中国語用の共起辞書と同様に作成する。これを説明すると、作成部１５は、先の形態素解析により得た結果のうち、韓国語名詞を含む文章のみを抽出してデータベースを作成する（ステップＳＢ５２）。例えば、作成部１５は、韓国語名詞「平壌」を含む文章「…北朝鮮（名詞）・の（助詞）・平壌（韓国語名詞）・で（助詞）・１７（数詞）・日（数助詞）、…」を抽出する。かかる文章の抽出にあたって韓国語名詞を検出するには、韓国語の国語辞書に載っている名詞と突き合わせすることにより検出することとしてもよいし、共起辞書の設計者による指示操作でコーパスに含まれる韓国語名詞に予めマーキングさせることとしてもよい。なお、ここでは、韓国語名詞を含む文章として１つの文章を例示したが、実際には多数の文章が抽出されるものとする。 On the other hand, the creation unit 15 creates a Korean co-occurrence dictionary in the same manner as the Chinese co-occurrence dictionary. Explaining this, the creation unit 15 creates a database by extracting only sentences containing Korean nouns from the results obtained by the previous morphological analysis (step SB52). For example, the creating unit 15 may include a sentence containing the Korean noun “Pyongyang” “... North Korea (noun), no (particle), Pyongyang (Korean noun), de (particle), 17 (numerical), and date (numerical particle). ), ... "is extracted. In order to detect Korean nouns when extracting such sentences, it is possible to detect Korean nouns by matching them with nouns in the Korean language dictionary, or include them in the corpus by an instruction operation by the co-occurrence dictionary designer. It is also possible to mark in advance a Korean noun. In addition, although one sentence was illustrated here as a sentence containing a Korean noun, in fact, many sentences shall be extracted.

そして、作成部１５は、韓国語名詞を含む文章のデータベースから、韓国語名詞の直後の語の集合、すなわち直後の語の集合を生成する（ステップＳＢ５３−１）。例えば、直後の語の集合としては、上記の１文の例では「で（助詞）」が生成される他、「へ（助詞）…・韓国（名詞）・総書記（名詞）…」などが生成されるものとする。 Then, the creation unit 15 generates a set of words immediately after the Korean noun, that is, a set of words immediately after the Korean noun, from a database of sentences including Korean nouns (step SB53-1). For example, as a set of words immediately after, in the example of the above sentence, “de (particle)” is generated, and “he (particle)… ・ Korea (noun), general secretary (noun)…”, etc. Shall be generated.

また、作成部１５は、韓国語名詞を含む文章のデータベースから、韓国語名詞およびその直後の語を除く単語の集合、すなわち一般共起語の集合を生成する（ステップＳＢ５３−２）。例えば、一般共起語の集合としては、「…北朝鮮（名詞）・の（助詞）・１７（数詞）・日（数助詞）、…」などが生成される。 Further, the creating unit 15 generates a set of words excluding the Korean noun and the word immediately after it, that is, a set of general co-occurrence words, from the database of sentences including Korean nouns (step SB53-2). For example, as a set of general co-occurrence words, “... North Korea (noun), (particle), 17 (numerical), date (numerical particle),.

続いて、作成部１５は、先に生成した２つの集合ごとに各単語の出現頻度を算出する（ステップＳＢ５４−１及びステップＳＢ５４−２）。図６の例で言えば、直後の語「で（助詞）」の出現頻度として「1671（回）」が算出され、直後の語「へ（助詞）」の出現頻度として「1422（回）」が算出される。さらに、直後の語「韓国（名詞）」の出現頻度として「160（回）」が算出され、直後の語「総書記（名詞）」の出現頻度として「133（回）」が算出される。一方、一般共起語「の（助詞）」の出現頻度として「2977（回）」が算出され、また、一般共起語「は（助詞）」の出現頻度として「2889（回）」が算出される。さらに、一般共起語「韓国（名詞）」の出現頻度として「156（回）」が算出され、一般共起語「北朝鮮（名詞）」の出現頻度として「161（回）」が算出され、一般共起語「来日（名詞）」の出現頻度として「128（回）」が算出される。なお、ここでも、以降の処理では、直後の語の集合および一般共起語の集合のうち品詞が名詞である単語を共起辞書に載せる対象とされる。 Subsequently, the creation unit 15 calculates the appearance frequency of each word for each of the two sets generated previously (Steps SB54-1 and SB54-2). In the example of FIG. 6, “1671 (times)” is calculated as the appearance frequency of the word “de (particle)” immediately after, and “1422 (times)” as the frequency of occurrence of the word “he (particle)” immediately after. Is calculated. Further, “160 (times)” is calculated as the appearance frequency of the word “Korea (noun)” immediately after, and “133 (times)” is calculated as the appearance frequency of the word “general secretary (noun)” immediately after. On the other hand, “2977 (times)” is calculated as the appearance frequency of the general co-occurrence word “no (particle)”, and “2889 (times)” is calculated as the appearance frequency of the general co-occurrence word “ha (particle)”. Is done. Furthermore, “156 (times)” is calculated as the appearance frequency of the general co-occurrence word “Korea (noun)”, and “161 (times)” is calculated as the appearance frequency of the general co-occurrence word “North Korea (noun)”. Then, “128 (times)” is calculated as the appearance frequency of the general co-occurrence word “visit to Japan (noun)”. In this case as well, in the subsequent processing, a word whose part of speech is a noun in a set of words immediately after and a set of general co-occurrence words is a target to be placed in the co-occurrence dictionary.

そして、作成部１５は、直後の語および一般共起語の集合ごとに各単語の対数尤度を算出する（ステップＳＢ５５−１及びステップＳＢ５５−２）。図６の例で言えば、直後の語「韓国（名詞）」の対数尤度として「2.486」が算出され、直後の語「総書記（名詞）」の対数尤度として「1.475」が算出される。一方、一般共起語「韓国（名詞）」の対数尤度として「2.401」が算出され、一般共起語「北朝鮮（名詞）」の対数尤度として「2.521」が算出され、一般共起語「来日（名詞）」の対数尤度として「1.312」が算出される。 Then, the creation unit 15 calculates the log likelihood of each word for each set of immediately following words and general co-occurrence words (step SB55-1 and step SB55-2). In the example of FIG. 6, “2.486” is calculated as the log likelihood of the immediately following word “Korea (noun)”, and “1.475” is calculated as the log likelihood of the immediately following word “general secretary (noun)”. The On the other hand, “2.401” is calculated as the log likelihood of the general co-occurrence word “Korea (noun)” and “2.521” is calculated as the log likelihood of the general co-occurrence word “North Korea (noun)”. “1.312” is calculated as the log likelihood of the word “visit to Japan (noun)”.

その後、作成部１５は、直後の語の対数尤度および一般共起語の対数尤度のうち直後の語の対数尤度に一般共起語の対数尤度よりも大きい重みを付与する重み付け処理を行った上で韓国語用の共起辞書として共起辞書記憶部１３に登録する（ステップＳＢ５６）。かかる重み付けの一例としては、作成部１５は、上記の中国語用の共起辞書の場合と同様に、直後の語の対数尤度の重みを一般共起語の対数尤度の２倍とし、直後の語の対数尤度に「２」を乗算する。 Thereafter, the creating unit 15 assigns a weight greater than the log likelihood of the general co-occurrence word to the log likelihood of the immediately subsequent word among the log likelihood of the immediately subsequent word and the log likelihood of the general co-occurrence word. Is registered in the co-occurrence dictionary storage unit 13 as a Korean co-occurrence dictionary (step SB56). As an example of such weighting, the creation unit 15 sets the log likelihood weight of the immediately following word to twice the log likelihood of the general co-occurrence word, as in the case of the Chinese co-occurrence dictionary, The log likelihood of the immediately following word is multiplied by “2”.

このように、上記のステップＳ５１〜ステップＳＢ５６までの処理により、図３Ｂに示した韓国語用の共起辞書が作成される。これによって、漢字表記の未知語が韓国語である確からしさを評価するための評価基準を定義することができる。なお、上記のステップＳＢ５３−１〜ＳＢ５５−１の処理と、上記のステップＳＢ５３−２〜ＳＢ５５−２の処理とは、両者を並列して実行することもできるし、いずれを先または後として処理を実行することとしてもかまわない。また、中国語用の共起辞書を作成するステップＳＡ５２〜ＳＡ５６の処理と、韓国語用の共起辞書を作成するステップＳＢ５２〜ＳＢ５６の処理とは、両者を並列して実行することもできるし、いずれを先または後として処理を実行することとしてもかまわない。 Thus, the Korean co-occurrence dictionary shown in FIG. 3B is created by the processing from step S51 to step SB56. This makes it possible to define an evaluation criterion for evaluating the probability that an unknown word in Kanji is Korean. In addition, the process of said step SB53-1-SB55-1 and the process of said step SB53-2-SB55-2 can also be performed in parallel, and either is processed before or after. It does not matter as executing. Further, the processes of steps SA52 to SA56 for creating a Chinese co-occurrence dictionary and the processes of steps SB52 to SB56 for creating a Korean co-occurrence dictionary can be executed in parallel. The process may be executed either before or after.

図１の説明に戻り、言語処理部１６は、受付部１１によって受け付けられた入力テキストに自然言語処理を実行する処理部である。この言語処理部１６は、図１に示すように、解析部１６ａと、抽出部１６ｂと、算出部１６ｃと、判別部１６ｄと、読み生成部１６ｅと、付与部１６ｆと、表音生成部１６ｇとをさらに有する。 Returning to the description of FIG. 1, the language processing unit 16 is a processing unit that executes natural language processing on the input text received by the receiving unit 11. As shown in FIG. 1, the language processing unit 16 includes an analysis unit 16a, an extraction unit 16b, a calculation unit 16c, a determination unit 16d, a reading generation unit 16e, a provision unit 16f, and a phonetic generation unit 16g. And further.

このうち、解析部１６ａは、形態素辞書記憶部１２を用いて、入力テキストに含まれる文章に形態素解析を実行することにより、文章を形態素に分割した上で各形態素に読みを付与する処理部である。一例として、受付部１１から「中国の成都を訪問する」という入力テキストが入力された場合を想定する。この場合に、解析部１６ａは、図２に示した形態素辞書から入力テキストに含まれる文字列と一致する形態素を検索して、入力テキストを「中国」、「の」、「成」、「都」、「を」、「訪問」、「する。」という形態素に分割する。このとき、解析部１６ａは、入力テキストに含まれる文字列のうち「成」及び「都」は実際には「成都」という２文字の単語であるが、形態素辞書には登録されてないので、これらの形態素が単漢字であると認識する。その上で、解析部１６ａは、各形態素に読みを付与し、「中国［チューゴク］・の［ノ］・成（単漢字）［セー］・都（単漢字）［ミヤコ］・を［オ］・訪問［ホウモン］・する［スル］。」という形態素解析の結果を得る。なお、上記の形態素解析には、既知のあらゆる形態素の解析手法を適用することができる。 Among these, the analysis unit 16a is a processing unit that uses the morpheme dictionary storage unit 12 to perform a morpheme analysis on a sentence included in the input text, thereby dividing the sentence into morphemes and giving a reading to each morpheme. is there. As an example, it is assumed that an input text “Visit Chengdu, China” is input from the reception unit 11. In this case, the analysis unit 16a searches the morpheme dictionary shown in FIG. 2 for a morpheme that matches the character string included in the input text, and sets the input text to “China”, “no”, “sei”, “city” ”,“ ”,“ Visit ”,“ do ”. At this time, the analysis unit 16a, in the character string included in the input text, “Cheng” and “City” are actually two-character words “Chengdu”, but are not registered in the morpheme dictionary. Recognize that these morphemes are single kanji characters. After that, the analysis unit 16a gives a reading to each morpheme, and reads “Chinese [Chugok], [No], Cheng (single kanji) [Se], Miyako (single kanji) [Miyako]], [o].・ The result of the morphological analysis is “Visit [Houmon]. Note that any known morpheme analysis method can be applied to the morpheme analysis.

抽出部１６ｂは、解析部１６ａによる形態素解析の結果から未知語を抽出する処理部である。一例としては、抽出部１６ｂは、入力テキストに含まれる形態素のうち、解析部１６ａによって単漢字であると解析された文字を漢字表記の未知語として抽出する。上記の入力テキストの例で言えば、抽出部１６ｂは、「成」及び「都」が単漢字として認識されているので、これら「成都」を漢字表記の未知語として抽出する。なお、「北京［ペキン］」や「上海（シャンハイ）」などのように、外来語として定着している単語については、形態素辞書に登録されているものとし、未知語として抽出されないものとする。 The extraction unit 16b is a processing unit that extracts unknown words from the result of morphological analysis by the analysis unit 16a. As an example, the extraction unit 16b extracts a character analyzed as a single kanji character by the analysis unit 16a from the morphemes included in the input text as an unknown word expressed in kanji. In the example of the input text above, the extraction unit 16b extracts “Chengdu” and “Chengdu” as unknown words in Chinese characters because “Chengdu” and “City” are recognized as single Chinese characters. It is assumed that words that are established as foreign words, such as “Beijing” and “Shanghai”, are registered in the morpheme dictionary and are not extracted as unknown words.

算出部１６ｃは、共起辞書記憶部１３を用いて、抽出部１６ｂによって抽出された漢字表記の未知語が他国語である確からしさを表す他国語スコアを算出する処理部である。これを説明すると、算出部１６ｃは、入力テキストに含まれる形態素の中に共起辞書に登録されている共起語が存在するか否か、すなわち漢字表記の未知語が他国語の名詞と共起関係を持ち得るか否かを判定する。このとき、共起辞書に登録されている共起語が存在する場合には、算出部１６ｃは、共起語に対応付けられている対数尤度を用いて、漢字表記の未知語の他国語スコアを算出する。そして、算出部１６ｃは、入力テキストに含まれる形態素の中に共起辞書に登録されている共起語がなくなるまで、他国語スコアの算出を繰り返し、新たに算出した他国語スコアを前回までに累積加算していた他国語スコアにさらに累積加算する。なお、他国語スコアの算出は、中国語用の共起辞書および韓国語用の共起辞書ごと、すなわち他国語ごとに実行される。以下では、中国語に関する他国語スコアを「中国語スコア」と呼び、韓国語に関する他国語スコアを「韓国語スコア」と呼ぶ。 The calculation unit 16c is a processing unit that uses the co-occurrence dictionary storage unit 13 to calculate a foreign language score representing the probability that the unknown word in Chinese characters extracted by the extraction unit 16b is a foreign language. Explaining this, the calculation unit 16c determines whether or not there is a co-occurrence word registered in the co-occurrence dictionary in the morpheme included in the input text, that is, the unknown word in kanji notation coexists with a noun in another language. It is determined whether or not a relationship can be established. At this time, if there is a co-occurrence word registered in the co-occurrence dictionary, the calculation unit 16c uses the log likelihood associated with the co-occurrence word to translate another language of the unknown word in kanji notation. Calculate the score. Then, the calculation unit 16c repeats the calculation of the other language score until there are no co-occurrence words registered in the co-occurrence dictionary in the morphemes included in the input text, and the newly calculated other language score is calculated until the previous time. Cumulatively add to the other language score that was cumulatively added. The calculation of the foreign language score is executed for each Chinese co-occurrence dictionary and Korean co-occurrence dictionary, that is, for each foreign language. Hereinafter, the foreign language score related to Chinese is referred to as “Chinese score”, and the foreign language score related to Korean is referred to as “Korean score”.

かかる他国語スコアは、一例として、算出式「他国語スコア＝共起語の対数尤度／未知語から形態素までの距離」を用いて算出される。ここで、上記の算出式において共起語の対数尤度を未知語から形態素までの距離で除すこととしたのは、未知語および形態素の距離が近いほど両者の相関性が強く、未知語が他国語である可能性がより高まるからである。なお、共起辞書に登録されている共起語が存在しない場合には、他国語スコアはゼロと算出されるものとする。また、共起辞書に一般共起語および直後の語の両方の共起語が存在する場合には、いずれか一方、例えば直後の語の対数尤度を他国語スコアに使用することとすればよい。 As an example, such a foreign language score is calculated using a calculation formula “foreign language score = log likelihood of co-occurrence word / distance from unknown word to morpheme”. Here, in the above formula, the log likelihood of the co-occurrence word is divided by the distance from the unknown word to the morpheme. The closer the distance between the unknown word and the morpheme, the stronger the correlation between the two. This is because there is a higher possibility that is a foreign language. If no co-occurrence word registered in the co-occurrence dictionary exists, the foreign language score is calculated as zero. Also, if there are co-occurrence words of both the general co-occurrence word and the immediately following word in the co-occurrence dictionary, for example, if the log likelihood of the immediately following word is used for the other language score Good.

上記のテキストの例で言えば、入力テキストに含まれる形態素の中に中国語用の共起辞書に一般共起語として登録されている「中国」が存在する。このため、算出部１６ｃは、上記の他国語スコアの算出式に、一般共起語「中国」に対応付けられている対数尤度「2.322」と、入力テキストに含まれる未知語「成都」から形態素「中国」までの距離「2」とを代入する。これによって、算出部１６ｃは、未知語「成都」の中国語スコア「1.161」を算出する。一方、入力テキストに含まれる形態素の中に韓国語用の共起辞書に登録されている一般共起語及び直後の語は存在しない。よって、算出部１６ｃは、未知語「成都」の韓国語スコア「0.000」を算出する。 In the example of the above text, “China” registered as a general co-occurrence word in the Chinese co-occurrence dictionary exists in the morphemes included in the input text. For this reason, the calculation unit 16c uses the log likelihood “2.322” associated with the general co-occurrence word “China” and the unknown word “Chengdu” included in the input text in the above calculation formula for the foreign language score. Substitute the distance “2” to the morpheme “China”. Thereby, the calculation unit 16c calculates the Chinese score “1.161” of the unknown word “Chengdu”. On the other hand, there is no general co-occurrence word or immediately following word registered in the Korean co-occurrence dictionary in the morphemes included in the input text. Therefore, the calculation unit 16c calculates the Korean score “0.000” of the unknown word “Chengdu”.

また、未知語から形態素までの距離は、入力テキストに含まれる形態素のうち任意の形態素を原点とし、他の形態素に座標を与えることにより算出できる。一例としては、文頭の形態素を原点とし、原点から形態素を１つ進むにつき座標の値を１つインクリメントすることにより、各形態素に座標位置を与えることができる。このようにして形態素に座標を与える場合には、未知語と形態素の位置関係によっては距離の値が負の値となってしまうので、絶対値を採るのが好ましい。上記のテキストの例で言えば、形態素「中国」の座標が「０」、未知語「成都」の座標が「２」であるので、算出部１６ｃは、未知語「成都」から形態素「中国」までの距離を｜２−０｜を計算することにより「２」と算出する。なお、距離の算出方法は、上記の方法に限定されず、未知語から目的の形態素までに到達するまでの形態素の数を計測することとしてもかまわない。 The distance from the unknown word to the morpheme can be calculated by giving an arbitrary morpheme among the morphemes included in the input text as an origin and giving coordinates to other morphemes. As an example, the coordinate position can be given to each morpheme by setting the morpheme at the beginning of the sentence as the origin and incrementing the coordinate value by one as the morpheme advances from the origin by one. When the coordinates are given to the morpheme in this way, the distance value becomes a negative value depending on the positional relationship between the unknown word and the morpheme, so it is preferable to take the absolute value. In the example of the above text, since the coordinate of the morpheme “China” is “0” and the coordinate of the unknown word “Chengdu” is “2”, the calculation unit 16c calculates the morpheme “China” from the unknown word “Chengdu”. Is calculated as “2” by calculating | 2-0 |. The method for calculating the distance is not limited to the above method, and the number of morphemes from the unknown word to the target morpheme may be measured.

判別部１６ｄは、算出部１６ｃによって算出された他国語スコアに基づいて、漢字表記の未知語がいずれの国の単語であるのかを判別する処理部である。これを説明すると、判別部１６ｄは、中国語スコアと韓国語スコアとを比較する。このとき、中国語スコアの方が韓国語スコアよりも高い場合には、判別部１６ｄは、中国語スコアと所定の閾値、例えば「1.000」とをさらに比較する。そして、中国語スコアが閾値よりも高い場合には、漢字表記の未知語が「中国語」であると判別できる。一方、中国語スコアが閾値以下である場合には、漢字表記の未知語が「日本語」であると判別できる。また、韓国語スコアが中国語スコア以上である場合には、判別部１６ｄは、韓国語スコアと所定の閾値、例えば「1.000」とを比較する。そして、韓国語スコアが閾値よりも高い場合には、漢字表記の未知語が「韓国語」であると判別できる。一方、韓国語スコアが閾値以下である場合には、漢字表記の未知語が「日本語」であると判別できる。 The discriminating unit 16d is a processing unit that discriminates which country the unknown word of kanji is based on, based on the foreign language score calculated by the calculating unit 16c. Explaining this, the determination unit 16d compares the Chinese score with the Korean score. At this time, if the Chinese score is higher than the Korean score, the determination unit 16d further compares the Chinese score with a predetermined threshold, for example, “1.000”. If the Chinese score is higher than the threshold, it can be determined that the unknown word in Chinese characters is “Chinese”. On the other hand, if the Chinese score is less than or equal to the threshold value, it can be determined that the unknown word in Chinese characters is “Japanese”. If the Korean score is greater than or equal to the Chinese score, the determination unit 16d compares the Korean score with a predetermined threshold, for example, “1.000”. If the Korean score is higher than the threshold, it can be determined that the unknown word in Kanji is “Korean”. On the other hand, if the Korean score is equal to or less than the threshold, it can be determined that the unknown word in Kanji notation is “Japanese”.

上記の入力テキストの例で言えば、中国語スコアが「1.161」であり、韓国語スコアが「0.000」であり、閾値が「1.000」である。このため、中国語スコア＞閾値＞韓国語スコアとなるので、未知語「成都」は中国語であると判別できる。 In the example of the input text above, the Chinese score is “1.161”, the Korean score is “0.000”, and the threshold is “1.000”. Therefore, since Chinese score> threshold> Korean score, it can be determined that the unknown word “Chengdu” is in Chinese.

なお、中国語スコア及び韓国語スコアと比較する閾値は、上記の値に限定されず、任意の値を採用できる。一例としては、入力テキストに付与されているタイトルが中国語用または韓国語用の共起辞書に登録されている場合には、中国語または韓国語に関する文章が入力される可能性が高くなるので、登録されていない場合よりも該当国の閾値を下げることもできる。また、中国語スコアと韓国語スコアとの間で異なる閾値を採用できるのも言うまでもない。 In addition, the threshold value compared with a Chinese score and a Korean score is not limited to said value, Arbitrary values are employable. As an example, if the title given to the input text is registered in the Chinese or Korean co-occurrence dictionary, there is a high possibility that sentences related to Chinese or Korean will be entered. The threshold of the corresponding country can be lowered as compared with the case where it is not registered. It goes without saying that different threshold values can be adopted between the Chinese score and the Korean score.

読み生成部１６ｅは、判別部１６ｄによる判別結果に応じて、漢字表記の未知語の読みを生成する処理部である。これを説明すると、読み生成部１６ｅは、「中国語スコア＞韓国語スコア」かつ「中国語スコア＞閾値」である場合、すなわち未知語が中国語であると判別された場合に、「中国語読みモード」がＯＦＦであるか否かをさらに判定する。このとき、「中国語読みモード」がＯＦＦである場合には、読み生成部１６ｅは、日本語辞書記憶部１４ａを用いて、未知語に対応する単漢字ごとに日本語の音読みを生成する。一方、「中国語読みモード」がＯＮである場合には、読み生成部１６ｅは、他国語辞書記憶部１４ｂに記憶された中国語用の漢字読み辞書を用いて、未知語に中国語読みを生成する。 The reading generation unit 16e is a processing unit that generates a reading of an unknown word in Chinese characters according to the determination result by the determination unit 16d. To explain this, the reading generation unit 16e determines that “Chinese score> Korean score” and “Chinese score> threshold”, that is, if it is determined that the unknown word is Chinese, It is further determined whether or not “reading mode” is OFF. At this time, when the “Chinese reading mode” is OFF, the reading generation unit 16e generates a Japanese phonetic reading for each single Chinese character corresponding to the unknown word, using the Japanese dictionary storage unit 14a. On the other hand, when the “Chinese reading mode” is ON, the reading generation unit 16e uses the Chinese kanji reading dictionary stored in the foreign language dictionary storage unit 14b to read Chinese readings into unknown words. Generate.

また、「韓国語スコア≧中国語スコア」かつ「韓国語スコア＞閾値」である場合、すなわち未知語が韓国語であると判別された場合に、読み生成部１６ｅは、他国語辞書記憶部１４ｂに記憶された韓国語用の漢字読み辞書を用いて、未知語に韓国語読みを生成する。また、「中国語スコア≦閾値」または「韓国語スコア≦閾値」である場合、すなわち未知語が日本語であると判別された場合に、読み生成部１６ｅは、未知語に新たな読みを生成せずに、解析部１６ａによる形態素解析の結果を付与部１６ｆへそのまま出力する。なお、抽出部１６ｂによって未知語が抽出されなかった場合にも、読み生成部１６ｅは、解析部１６ａによる形態素解析の結果を付与部１６ｆへそのまま出力する。 When “Korean score ≧ Chinese score” and “Korean score> threshold”, that is, when it is determined that the unknown word is Korean, the reading generation unit 16e displays the foreign language dictionary storage unit 14b. A Korean reading is generated for an unknown word using the Korean kanji reading dictionary stored in. When “Chinese score ≦ threshold” or “Korean score ≦ threshold”, that is, when it is determined that the unknown word is Japanese, the reading generation unit 16e generates a new reading for the unknown word. Instead, the result of the morphological analysis by the analyzing unit 16a is output to the adding unit 16f as it is. Even when an unknown word is not extracted by the extraction unit 16b, the reading generation unit 16e outputs the result of the morphological analysis by the analysis unit 16a to the adding unit 16f as it is.

上記のテキストの例で言えば、読み生成部１６ｅは、日本語辞書記憶部１４ａによって記憶された「成［セー］」及び「都［ト］」にしたがって漢字表記の未知語「成都」に読み「セート」を生成する。このため、誤った形態素解析の結果「セーミヤコ」が未知語「成都」に付与されることを防止できる。 In the example of the text described above, the reading generation unit 16e reads the unknown word “Chengdu” in kanji notation according to “sei” and “to” stored in the Japanese dictionary storage unit 14a. Generate a “sate”. For this reason, it is possible to prevent the incorrect word “Chengdu” from being assigned to the unknown word “Chengdu” as a result of an erroneous morphological analysis.

付与部１６ｆは、読み生成部１６ｅによって生成された漢字表記の未知語の読みにアクセント（accent）を付与する処理部である。一例としては、付与部１６ｆは、漢字表記の未知語の後ろから２文字目の第１モーラ（mora）、すなわち１短音節にアクセントを付与する。 The assigning unit 16f is a processing unit that assigns an accent to the reading of an unknown word in kanji notation generated by the reading generating unit 16e. As an example, the assigning unit 16f assigns an accent to the first mora, that is, one short syllable of the second character from the back of the unknown word in Chinese characters.

図７は、アクセント付与の一例を示す図である。図７に示すように、未知語が「李鵬」である場合には、付与部１６ｆは、後ろから２文字目の「李［リ］」にアクセントを付与し、［リ’ホウ］とする。また、未知語が「劉備」である場合には、付与部１６ｆは、後ろから２文字目の「劉［リュウ］」の第１モーラ「リュ」にアクセントを付与し、［リュ’ウビ］とする。また、未知語が「毛沢東」である場合には、付与部１６ｆは、後ろから２文字目の「沢［タク］」の第１モーラ「タ」にアクセントを付与し、［モウタ’クトウ］とする。また、未知語が「胡錦濤」である場合には、付与部１６ｆは、後ろから２文字目の「錦［キン］」の第１モーラ「キ」にアクセントを付与し、［コキ’ントウ］とする。また、未知語が「青椒肉絲」である場合には、付与部１６ｆは、後ろから２文字目の「肉［ロウ］」の第１モーラ「ロ」にアクセントを付与し、［チンジャオロ’ウス］とする。 FIG. 7 is a diagram illustrating an example of accent assignment. As shown in FIG. 7, when the unknown word is “Li Yi”, the assigning unit 16 f gives an accent to “Li [Li]” as the second character from the back to be [Li ′ Ho]. If the unknown word is “Liu Bei”, the assigning unit 16f gives an accent to the first mora “Ryu” of the second character “Liu” from the back, and [Ryu Ubi] To do. When the unknown word is “Masawa”, the assigning unit 16f gives an accent to the first mora “ta” of the second character “sawa [taku]” from the back, and [Mouta Kaku] To do. Also, when the unknown word is “Hu Jintao”, the adding unit 16f adds an accent to the first mora “ki” of the second character “Kin” from the back, and [Kokitoto] To do. In addition, when the unknown word is “blue crab meat”, the adding unit 16f adds an accent to the first mora “ro” of the second character “meat” from the back, ].

表音生成部１６ｇは、付与部１６ｆによって漢字表記の未知語の読みにアクセントが付与された形態素解析の結果から入力テキストの表音文字列を生成する処理部である。一例として、表音生成部１６ｇは、漢字表記の未知語の読みにアクセントが付与された漢字かな混じり文から表音文字であるカタカナの文字列を生成する。なお、抽出部１６ｂによって未知語が抽出されなかった場合、あるいは判別部１６ｄによって未知語が日本語であると判別された場合には、解析部１６ａによる形態素解析の結果から表音文字列が生成される。 The phonetic generation unit 16g is a processing unit that generates a phonetic character string of the input text from the result of the morphological analysis in which an accent is added to the reading of an unknown word written in Chinese characters by the adding unit 16f. As an example, the phonetic generation unit 16g generates a katakana character string, which is a phonetic character, from a kanji-kana mixed sentence in which an accent is added to an unknown word reading in kanji notation. If an unknown word is not extracted by the extraction unit 16b, or if the unknown word is determined to be Japanese by the determination unit 16d, a phonetic character string is generated from the result of the morphological analysis by the analysis unit 16a. Is done.

韻律生成部１７は、表音生成部１６ｇによって生成された表音文字列に基づいて入力テキストに対応する韻律を生成する処理部である。ここで言う「韻律」は、ポーズ（pause）、音素の長さやイントネーション（intonation）などの喋り方の特徴の総称である。一態様としては、韻律生成部１７は、後述の合成部１８に合成させる音声、すなわち合成音声の個々の音素の長さである音素時間長や声の高さの変化パターンであるピッチパターン（pitch pattern）などの韻律を生成する。 The prosody generation unit 17 is a processing unit that generates a prosody corresponding to the input text based on the phonetic character string generated by the phonetic generation unit 16g. The “prosody” mentioned here is a general term for the characteristics of how to speak, such as pause, phoneme length and intonation. As one aspect, the prosody generation unit 17 synthesizes a voice to be synthesized by the synthesis unit 18 to be described later, that is, a pitch pattern (pitch) which is a phoneme time length that is the length of each phoneme of the synthesized speech or a pitch change pattern. pattern) and the like.

合成部１８は、韻律生成部１７によって生成された韻律から音声波形を生成して音声を合成する処理部である。一態様としては、合成部１８は、韻律生成部１７によって生成された韻律、例えば音素時間長やピッチパターンにしたがって音声波形を生成することにより音声を人工的に合成する。 The synthesis unit 18 is a processing unit that generates a speech waveform from the prosody generated by the prosody generation unit 17 and synthesizes speech. As one aspect, the synthesizer 18 artificially synthesizes speech by generating a speech waveform according to the prosody generated by the prosody generator 17, for example, phoneme time length or pitch pattern.

出力部１９は、音声を出力する出力部である。一例として、出力部１９は、合成部１８から入力される合成音声を出力する。かかる出力部１９の一態様としては、スピーカー（speaker）などが挙げられる。 The output unit 19 is an output unit that outputs sound. As an example, the output unit 19 outputs the synthesized speech input from the synthesis unit 18. An example of the output unit 19 includes a speaker.

なお、図１に示した受付部１１、作成部１５、言語処理部１６、韻律生成部１７及び合成部１８には、各種の集積回路や電子回路を採用できる。また、言語処理部１６に含まれる機能部の一部を別の集積回路や電子回路とすることもできる。例えば、集積回路としては、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）が挙げられる。また、電子回路としては、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などが挙げられる。 Note that various types of integrated circuits and electronic circuits can be employed for the reception unit 11, the creation unit 15, the language processing unit 16, the prosody generation unit 17, and the synthesis unit 18 illustrated in FIG. Further, a part of the functional unit included in the language processing unit 16 may be another integrated circuit or an electronic circuit. For example, examples of the integrated circuit include ASIC (Application Specific Integrated Circuit) and FPGA (Field Programmable Gate Array). Examples of the electronic circuit include a central processing unit (CPU) and a micro processing unit (MPU).

また、図１に示した形態素辞書記憶部１２、共起辞書記憶部１３、日本語辞書記憶部１４ａ、他国語辞書記憶部１４ｂ及びコーパス記憶部１５ａのハードウェアには、次のようなものを適用できる。一例としては、ＲＡＭ（Random Access Memory)、ＲＯＭ（Read Only Memory）やフラッシュメモリ（flash memory）などの半導体メモリ素子を採用できる。なお、上記の５つの記憶部にハードディスク、光ディスクなどの記憶装置を採用することとしてもかまわない。 The hardware of the morpheme dictionary storage unit 12, the co-occurrence dictionary storage unit 13, the Japanese dictionary storage unit 14a, the foreign language dictionary storage unit 14b, and the corpus storage unit 15a shown in FIG. Applicable. As an example, a semiconductor memory element such as a random access memory (RAM), a read only memory (ROM), or a flash memory can be employed. It should be noted that a storage device such as a hard disk or an optical disk may be adopted for the above five storage units.

［処理の流れ］
次に、本実施例に係る音声合成装置の処理の流れを説明する。なお、ここでは、漢字表記の未知語に読みを生成する（１）読み生成処理を説明した後に、他国語スコアを算出する（２）他国語スコア算出処理を説明する。 [Process flow]
Next, the flow of processing of the speech synthesizer according to the present embodiment will be described. Here, after explaining the (1) reading generation process for generating a reading for an unknown word in Kanji notation, (2) calculating the other language score will be described.

（１）読み生成処理
図８は、実施例１に係る読み生成処理の手順を示すフローチャートである。この読み生成処理は、受付部１１によって１文の入力テキストが言語処理部１６へ入力された場合に処理が起動する。 (1) Reading Generation Processing FIG. 8 is a flowchart illustrating the procedure of reading generation processing according to the first embodiment. This reading generation process is started when a single input text is input to the language processing unit 16 by the receiving unit 11.

図８に示すように、解析部１６ａは、形態素辞書記憶部１２を用いて、入力テキストに含まれる文章に形態素解析を実行する（ステップＳ１０１）。続いて、抽出部１６ｂは、入力テキストに含まれる形態素のうち、解析部１６ａによって単漢字であると解析された文字を漢字表記の未知語として抽出する（ステップＳ１０２）。 As shown in FIG. 8, the analysis unit 16a uses the morpheme dictionary storage unit 12 to perform morpheme analysis on a sentence included in the input text (step S101). Subsequently, the extraction unit 16b extracts a character analyzed by the analysis unit 16a as a single kanji character from the morphemes included in the input text as an unknown word in kanji notation (step S102).

このとき、未知語が抽出されなかった場合（ステップＳ１０２否定）には、読み生成部１６ｅは、解析部１６ａによる形態素解析の結果を付与部１６ｆへそのまま出力し（ステップＳ１０３）、処理を終了する。 At this time, when an unknown word is not extracted (No at Step S102), the reading generation unit 16e outputs the result of the morphological analysis by the analysis unit 16a to the adding unit 16f as it is (Step S103), and ends the processing. .

ここで、未知語が抽出された場合（ステップＳ１０２肯定）には、算出部１６ｃは、共起辞書記憶部１３を用いて、中国語スコア及び韓国語スコアなどの他国語スコアを算出する「他国語スコア算出処理」を実行する（ステップＳ１０４）。 Here, when an unknown word is extracted (Yes in step S102), the calculation unit 16c uses the co-occurrence dictionary storage unit 13 to calculate other language scores such as a Chinese score and a Korean score. "National language score calculation process" is executed (step S104).

その後、判別部１６ｄは、中国語スコアと韓国語スコアとを比較する（ステップＳ１０５）。このとき、中国語スコアの方が韓国語スコアよりも高い場合（ステップＳ１０５肯定）には、判別部１６ｄは、中国語スコアと所定の閾値、例えば「1.000」とをさらに比較する（ステップＳ１０６）。 Thereafter, the determination unit 16d compares the Chinese score with the Korean score (step S105). At this time, if the Chinese score is higher than the Korean score (Yes at Step S105), the determination unit 16d further compares the Chinese score with a predetermined threshold, for example, “1.000” (Step S106). .

そして、中国語スコアが閾値よりも高い場合（ステップＳ１０６肯定）には、漢字表記の未知語が「中国語」であると判別できる。この場合には、読み生成部１６ｅは、「中国語読みモード」がＯＦＦであるか否かをさらに判定する（ステップＳ１０７）。 If the Chinese score is higher than the threshold (Yes at Step S106), it can be determined that the unknown word in Chinese characters is “Chinese”. In this case, the reading generation unit 16e further determines whether or not “Chinese reading mode” is OFF (step S107).

このとき、「中国語読みモード」がＯＦＦである場合（ステップＳ１０７肯定）には、読み生成部１６ｅは、日本語辞書記憶部１４ａを用いて、未知語に対応する単漢字ごとに日本語の音読みを生成し（ステップＳ１０９）、処理を終了する。 At this time, if the “Chinese reading mode” is OFF (Yes at Step S107), the reading generation unit 16e uses the Japanese dictionary storage unit 14a to store Japanese for each single Chinese character corresponding to the unknown word. Sound reading is generated (step S109), and the process is terminated.

一方、「中国語読みモード」がＯＮである場合（ステップＳ１０７否定）には、読み生成部１６ｅは、他国語辞書記憶部１４ｂに記憶された中国語用の漢字読み辞書を用いて、未知語に中国語読みを生成する（ステップＳ１０８）。 On the other hand, when the “Chinese reading mode” is ON (No at Step S107), the reading generation unit 16e uses the Chinese kanji reading dictionary stored in the foreign language dictionary storage unit 14b to identify an unknown word. A Chinese reading is generated (step S108).

また、中国語スコアが閾値以下である場合（ステップＳ１０６否定）には、漢字表記の未知語が「日本語」であると判別できる。この場合には、読み生成部１６ｅは、未知語に新たな読みを生成せずに、解析部１６ａによる形態素解析の結果を付与部１６ｆへそのまま出力し（ステップＳ１０３）、処理を終了する。 If the Chinese score is less than or equal to the threshold (No at step S106), it can be determined that the unknown word in Kanji is “Japanese”. In this case, the reading generation unit 16e outputs the result of the morphological analysis by the analysis unit 16a to the adding unit 16f without generating a new reading for the unknown word (step S103), and ends the process.

また、韓国語スコアが中国語スコア以上である場合（ステップＳ１０５否定）には、判別部１６ｄは、韓国語スコアと所定の閾値、例えば「1.000」とを比較する（ステップＳ１１０）。 If the Korean score is greater than or equal to the Chinese score (No at Step S105), the determination unit 16d compares the Korean score with a predetermined threshold, for example, “1.000” (Step S110).

このとき、韓国語スコアが閾値よりも高い場合（ステップＳ１１０肯定）には、漢字表記の未知語が「韓国語」であると判別できる。この場合には、読み生成部１６ｅは、他国語辞書記憶部１４ｂに記憶された韓国語用の漢字読み辞書を用いて、未知語に韓国語読みを生成し（ステップＳ１１１）、処理を終了する。 At this time, if the Korean score is higher than the threshold (Yes at Step S110), it can be determined that the unknown word in Kanji notation is “Korean”. In this case, the reading generation unit 16e generates a Korean reading for the unknown word using the Korean kanji reading dictionary stored in the foreign language dictionary storage unit 14b (step S111), and ends the process. .

一方、韓国語スコアが閾値以下である場合（ステップＳ１１０否定）には、漢字表記の未知語が「日本語」であると判別できる。この場合には、読み生成部１６ｅは、未知語に新たな読みを生成せずに、解析部１６ａによる形態素解析の結果を付与部１６ｆへそのまま出力し（ステップＳ１０３）、処理を終了する。 On the other hand, if the Korean score is equal to or less than the threshold (No at Step S110), it can be determined that the unknown word in Kanji is “Japanese”. In this case, the reading generation unit 16e outputs the result of the morphological analysis by the analysis unit 16a to the adding unit 16f without generating a new reading for the unknown word (step S103), and ends the process.

（２）他国語スコア算出処理
図９は、実施例１に係る中国語スコア算出処理の手順を示すフローチャートである。この中国語スコア算出処理は、図８に示したステップＳ１０４に対応する処理であり、抽出部１６ｂによって漢字表記の未知語が抽出された場合に処理が起動する。なお、ここでは、他国語スコアのうち中国語スコアを算出する場合を例示しているが、韓国語スコアを算出する場合も同様である。 (2) Other Language Score Calculation Processing FIG. 9 is a flowchart illustrating a procedure of Chinese score calculation processing according to the first embodiment. This Chinese score calculation process is a process corresponding to step S104 shown in FIG. 8, and starts when an unknown word in Chinese characters is extracted by the extraction unit 16b. Here, the case where the Chinese score is calculated among the other language scores is illustrated, but the same applies to the case where the Korean score is calculated.

図９に示すように、算出部１６ｃは、各種のパラメータを設定する（ステップＳ３０１）。一例としては、算出部１６ｃは、中国語スコア及びループカウンタＩをゼロに初期化するともに未知語の位置Ｊ及び文中の形態素の数Ｎに値を設定する。 As illustrated in FIG. 9, the calculation unit 16c sets various parameters (step S301). As an example, the calculation unit 16c initializes the Chinese score and the loop counter I to zero, and sets values for the position J of the unknown word and the number N of morphemes in the sentence.

そして、算出部１６ｃは、ループカウンタＩが文中の単語の数Ｎに等しくなるまで（ステップＳ３０２肯定）、下記のステップＳ３０３〜ステップＳ３０７までの処理を繰り返し実行する。 Then, the calculating unit 16c repeatedly executes the following steps S303 to S307 until the loop counter I becomes equal to the number N of words in the sentence (Yes at Step S302).

すなわち、算出部１６ｃは、Ｉ番目の形態素が中国語用の共起辞書に登録されているか否か、すなわち漢字表記の未知語が中国語の名詞と共起関係を持ち得るか否かを判定する（ステップＳ３０３）。 That is, the calculation unit 16c determines whether or not the I-th morpheme is registered in the Chinese co-occurrence dictionary, that is, whether or not an unknown word in Chinese characters can have a co-occurrence relationship with a Chinese noun. (Step S303).

このとき、Ｉ番目の形態素が中国語用の共起辞書に登録されている場合（ステップＳ３０３肯定）には、算出部１６ｃは、中国語用の共起辞書からＩ番目の形態素に対応する対数尤度を抽出する（ステップＳ３０４）。そして、算出部１６ｃは、未知語から形態素までの距離ＬをループカウンタＩ−未知語の位置Ｊの絶対値、すなわち｜Ｉ−Ｊ｜を計算することにより算出する（ステップＳ３０５）。 At this time, when the I-th morpheme is registered in the Chinese co-occurrence dictionary (Yes in step S303), the calculating unit 16c calculates the logarithm corresponding to the I-th morpheme from the Chinese co-occurrence dictionary. The likelihood is extracted (step S304). The calculating unit 16c calculates the distance L from the unknown word to the morpheme by calculating the absolute value of the loop counter I-the position J of the unknown word, that is, | I-J | (step S305).

続いて、算出部１６ｃは、算出式「中国語スコア＝Ｉ番目の形態素に対応する対数尤度／未知語から形態素までの距離Ｌ」によりＩ番目の形態素の中国語スコアを算出して「Ｉ−１」番目までに累積加算していた中国語スコアに累積加算する（ステップＳ３０６）。 Subsequently, the calculation unit 16c calculates the Chinese score of the I-th morpheme by the calculation formula “Chinese score = log likelihood corresponding to the I-th morpheme / distance L from unknown word to morpheme” to calculate “I Cumulative addition is added to the Chinese score that has been cumulatively added up to the “−1” th (step S306).

そして、算出部１６ｃは、ループカウンタＩをインクリメントし（ステップＳ３０７）、上記のステップＳ３０２に移行する。なお、Ｉ番目の形態素が中国語用の共起辞書に登録されていなかった場合（ステップＳ３０３否定）には、ステップＳ３０４〜ステップＳ３０６までの処理は実行されずにループカウンタＩがインクリメントされる（ステップＳ３０７）。 Then, the calculation unit 16c increments the loop counter I (step S307), and proceeds to the above step S302. If the I-th morpheme is not registered in the Chinese co-occurrence dictionary (No at Step S303), the processing from Step S304 to Step S306 is not executed and the loop counter I is incremented ( Step S307).

その後、ループカウンタＩが文中の単語の数Ｎに等しくなった場合（ステップＳ３０２否定）には、入力テキストの全ての形態素について中国語スコアを算出し終えたことになるので、処理を終了する。 After that, when the loop counter I becomes equal to the number N of words in the sentence (No at Step S302), the Chinese score has been calculated for all the morphemes of the input text, and thus the process ends.

［実施例１の効果］
上述してきたように、本実施例に係る音声合成装置１０は、形態素辞書を用いて漢字を含んだ文章に形態素解析を実行することにより、文章を形態素に分割した上で各形態素に読みを付与する。さらに、音声合成装置１０は、形態素解析の結果から漢字表記の未知語を抽出する。さらに、音声合成装置１０は、漢字表記の未知語が他国語である確からしさを表す他国語スコアを算出する。さらに、音声合成装置１０は、他国語スコアに基づいて、漢字表記の未知語がいずれの国の単語であるのかを判別する。さらに、音声合成装置１０は、判別結果に応じて漢字表記の未知語の読みを生成する。 [Effect of Example 1]
As described above, the speech synthesizer 10 according to the present embodiment performs a morpheme analysis on a sentence including kanji using a morpheme dictionary, thereby dividing the sentence into morphemes and giving a reading to each morpheme. To do. Furthermore, the speech synthesizer 10 extracts an unknown word in Chinese characters from the result of morphological analysis. Furthermore, the speech synthesizer 10 calculates a foreign language score representing the probability that the unknown word in Kanji is a foreign language. Furthermore, the speech synthesizer 10 determines which country the unknown word in Kanji is based on the other language score. Furthermore, the speech synthesizer 10 generates a reading of an unknown word in Chinese characters according to the determination result.

それゆえ、本実施例に係る音声合成装置１０では、漢字表記の未知語が他国語である確からしさを他国語スコアとして評価した上で他国語スコアが高い他国語の読みを未知語の読みとして生成できる。このため、本実施例に係る音声合成装置１０では、日本語のテキストの中に漢字表記の中国語、韓国語や台湾語などの他国語が含まれていたとしても、他国語の文字列に誤った日本語の読みを付与することを防止できる。よって、本実施例に係る音声合成装置１０によれば、テキストに含まれる漢字表記の他国語を正確に読み上げることが可能になる。 Therefore, in the speech synthesizer 10 according to the present embodiment, the probabilities that an unknown word in Kanji is a foreign language is evaluated as the foreign language score, and then the reading of the foreign language having a high foreign language score is used as the unknown word reading. Can be generated. For this reason, in the speech synthesizer 10 according to the present embodiment, even if the Japanese text includes other languages such as Chinese, Korean, Taiwanese, etc., the character strings of the other languages are included. It can prevent giving wrong Japanese readings. Therefore, according to the speech synthesizer 10 according to the present embodiment, it is possible to accurately read out the other language of the Chinese character notation included in the text.

また、本実施例に係る音声合成装置１０は、文章に含まれる未知語以外の形態素と、他国語の名詞との間における共起関係を用いて、他国語スコアを算出する。このため、本実施例に係る音声合成装置１０によれば、漢字表記の未知語が他国語の名詞と共起関係を持ち得るか否かを他国語スコアにより評価することが可能になる。 In addition, the speech synthesizer 10 according to the present embodiment calculates a foreign language score using a co-occurrence relationship between a morpheme other than an unknown word included in a sentence and a noun in another language. For this reason, according to the speech synthesizer 10 according to the present embodiment, it is possible to evaluate whether or not an unknown word expressed in Kanji can have a co-occurrence relationship with a noun in another language by using the other language score.

さらに、本実施例に係る音声合成装置１０は、漢字表記の未知語の読みにアクセントを付与し、漢字表記の未知語の読みにアクセントが付与された形態素解析の結果から入力テキストの表音文字列を生成する。このため、本実施例に係る音声合成装置１０では、未知語の読みに加えてアクセントも適切に付与した上で表音文字列を生成できる。よって、本実施例に係る音声合成装置１０によれば、テキストに含まれる漢字表記の他国語をより正確に読み上げることが可能になる。 Furthermore, the speech synthesizer 10 according to the present embodiment adds an accent to the reading of an unknown word in kanji notation, and the phonetic character of the input text from the result of the morphological analysis in which the accent is added to the reading of the unknown word in kanji notation. Generate a column. For this reason, the speech synthesizer 10 according to the present embodiment can generate a phonetic character string after appropriately adding an accent in addition to reading an unknown word. Therefore, according to the speech synthesizer 10 according to the present embodiment, it is possible to read out the other language of the kanji notation included in the text more accurately.

さて、上記の実施例１では、共起辞書を用いて他国語スコアを算出する場合を例示したが、開示の装置はこれに限定されず、他の方法により他国語スコアを算出することもできる。そこで、実施例２では、単漢字が他国語である確からしさが定義された単漢字辞書を用いて、他国語スコアを算出する場合について説明する。 In the first embodiment, the case where the foreign language score is calculated using the co-occurrence dictionary is illustrated. However, the disclosed apparatus is not limited to this, and the other language score can be calculated by another method. . Therefore, in the second embodiment, a case will be described in which a foreign language score is calculated using a single Chinese character dictionary in which the probability that a single Chinese character is a foreign language is defined.

図１０は、実施例２に係る音声合成装置の構成を示すブロック図である。なお、本実施例では、上記の実施例１に係る音声合成装置１０と同様の機能を有する機能部には同一の符号を付し、その説明を省略する。 FIG. 10 is a block diagram illustrating the configuration of the speech synthesizer according to the second embodiment. In the present embodiment, the same reference numerals are given to functional units having the same functions as those of the speech synthesizer 10 according to the first embodiment, and the description thereof is omitted.

図１０に示す音声合成装置３０は、図１に示した音声合成装置１０に比べて、共起辞書記憶部１３、コーパス記憶部１５ａ、作成部１５、言語処理部１６に代えて、単漢字辞書記憶部３１、他国語名詞記憶部３２ａ、作成部１５、言語処理部３３を有する点が異なる。さらに、図１０に示す算出部３３ａは、図１に示した算出部１６ｃに比べて、その処理内容が異なる。 Compared to the speech synthesizer 10 shown in FIG. 1, the speech synthesizer 30 shown in FIG. 10 replaces the co-occurrence dictionary storage unit 13, the corpus storage unit 15 a, the creation unit 15, and the language processing unit 16 with a single kanji dictionary. The point which has the memory | storage part 31, the foreign language noun memory | storage part 32a, the preparation part 15, and the language processing part 33 differs. Furthermore, the calculation part 33a shown in FIG. 10 differs in the processing content compared with the calculation part 16c shown in FIG.

このうち、単漢字辞書記憶部３１は、単漢字の他国語らしさが定義された単漢字辞書を記憶する記憶部である。一態様として、単漢字辞書記憶部３１は、単漢字ごとにその単漢字の中国語らしさ及び韓国語らしさを対応付けて記憶する。ここで言う「中国語らしさ」とは、単漢字が中国語である確からしさを指し、単漢字が中国語の名詞とともに出現する頻度から対数尤度を求めることにより算出される。同様に、「韓国語らしさ」とは、単漢字が韓国語である確からしさを指し、単漢字が韓国語の名詞とともに出現する頻度から対数尤度を求めることにより算出される。なお、単漢字辞書は、抽出部１６ｂによって抽出された未知語を形成する単漢字に対応付けられた他国語らしさが後述の算出部３３ａによって他国語スコアの算出に用いられる。 Among these, the single Chinese character dictionary storage unit 31 is a storage unit that stores a single Chinese character dictionary in which the uniqueness of a single Chinese character is defined. As an aspect, the single Chinese character dictionary storage unit 31 stores the Chinese character and Korean character of the single Chinese character in association with each single Chinese character. The “Chinese character” here refers to the probability that a single Chinese character is Chinese, and is calculated by calculating the log likelihood from the frequency at which the single Chinese character appears with a Chinese noun. Similarly, “Koreanness” refers to the probability that a single Chinese character is Korean, and is calculated by calculating the log likelihood from the frequency with which a single Chinese character appears together with a Korean noun. In the single kanji dictionary, the uniqueness of another language associated with the single kanji that forms the unknown word extracted by the extraction unit 16b is used for the calculation of the other language score by the calculation unit 33a described later.

図１１は、単漢字辞書記憶部に記憶される情報の構成例を示す図である。図１１の例では、単漢字「金」の中国語らしさが「1.534」であり、韓国語らしさが「4.126」であることを示す。また、単漢字「釜」の中国語らしさが「1.478」であり、韓国語らしさが「3.632」であることを示す。また、単漢字「山」の中国語らしさが「1.759」であり、韓国語らしさが「1.758」であることを示す。また、単漢字「李」の中国語らしさが「3.335」であり、韓国語らしさが「3.411」であることを示す。以降も同様にして、単漢字「成」、「都」、「劉」、「京」の中国語らしさ及び韓国語らしさが定義されている。 FIG. 11 is a diagram illustrating a configuration example of information stored in the single kanji dictionary storage unit. In the example of FIG. 11, the Chinese character of the single Chinese character “Kin” is “1.534” and the Korean character is “4.126”. In addition, the Chinese character of the single kanji character “Kama” is “1.478”, and the Korean character is “3.632”. In addition, the Chinese character of the single Chinese character “Yama” is “1.759”, and the Korean character is “1.758”. Also, the Chinese character of the single Chinese character “Li” is “3.335” and the Korean character is “3.411”. In the same manner, the Chinese character and Korean character of the single kanji characters “Cheng”, “Miyako”, “Liu” and “Kyo” are defined.

他国語名詞記憶部３２ａは、他国語の名詞リストを記憶する記憶部である。一態様として、他国語名詞記憶部３２ａは、中国語の名詞リストと、韓国語の名詞リストとを記憶する。かかる他国語の名詞リストは、後述の作成部３２によって単漢字辞書が作成される場合に使用される。このため、名詞リストとして準備する文章の量が多いほど、後述の算出部３３ａによって算出される他国語スコアの精度を高めることができる。 The foreign language noun storage unit 32a is a storage unit that stores a noun list of other languages. As one aspect, the foreign language noun storage unit 32a stores a Chinese noun list and a Korean noun list. Such a noun list of other languages is used when a single kanji dictionary is created by the creation unit 32 described later. For this reason, the accuracy of the foreign language score calculated by the calculation part 33a mentioned later can be improved, so that there are many sentences prepared as a noun list.

作成部３２は、他国語名詞記憶部３２ａを用いて、単漢字辞書を作成する処理部である。この単漢字辞書の作成方法を図１２を用いて説明する。図１２は、単漢字辞書の作成方法を説明するための図である。なお、図１２の例では、単漢字の中国語らしさ及び韓国語らしさを定義する場合を例示するが、漢字を含んだ国語、例えば台湾語であれば同様にして台湾語らしさを定義できる。 The creation unit 32 is a processing unit that creates a single kanji dictionary using the foreign language noun storage unit 32a. A method for creating this single kanji dictionary will be described with reference to FIG. FIG. 12 is a diagram for explaining a method of creating a single kanji dictionary. In the example of FIG. 12, the case of defining the Chinese character and Korean character of a single kanji character is illustrated. However, if it is a national language including a Chinese character, for example, Taiwanese, the character of the Taiwanese character can be defined similarly.

図１２に示すように、作成部３２は、他国語名詞記憶部３２ａに記憶された名詞リストのうち中国語の名詞リストを読み出し、中国語の名詞リストに搭載されている中国語の名詞を１つの漢字列に連結する（ステップＳＡ７１）。図１２の例では、作成部３２は、「北京」、「南京」、「劉備」…の中国語の名詞を連結し、１つの中国語名詞の漢字列「北京南京劉備…」とする。 As shown in FIG. 12, the creation unit 32 reads a Chinese noun list from the noun list stored in the foreign language noun storage unit 32a, and sets the Chinese noun list loaded in the Chinese noun list to 1 Two kanji strings are connected (step SA71). In the example of FIG. 12, the creating unit 32 concatenates Chinese nouns “Beijing”, “Nanjing”, “Liu Bei”, and so on to form a Chinese character string “Beijing Nanjing Liu Bei ...”.

そして、作成部３２は、中国語名詞の漢字列に含まれる各漢字が漢字列に出現する頻度、以下「出現頻度」と記載を算出する（ステップＳＡ７２）。図１２の例で言えば、漢字「劉」の出現頻度として「37（回）」が算出され、漢字「京」の出現頻度として「35（回）」が算出される。さらに、漢字「李」の出現頻度として「32（回）」が算出され、漢字「都」の出現頻度として「26（回）」が算出される。さらに、漢字「成」の出現頻度として「25（回）」が算出される。 Then, the creating unit 32 calculates the frequency at which each Chinese character included in the Chinese character string of the Chinese noun appears in the Chinese character string, hereinafter referred to as “appearance frequency” (step SA72). In the example of FIG. 12, “37 (times)” is calculated as the appearance frequency of the Chinese character “Liu”, and “35 (times)” is calculated as the appearance frequency of the Chinese character “Kyo”. Furthermore, “32 (times)” is calculated as the appearance frequency of the Chinese character “Li”, and “26 (times)” is calculated as the appearance frequency of the Chinese character “Miyako”. Further, “25 (times)” is calculated as the appearance frequency of the Chinese character “Sei”.

そして、作成部３２は、各漢字の対数尤度を算出する（ステップＳＡ７３）。かかる対数尤度には、一例として、算出式「対数尤度＝ｌｏｇ（出現頻度＊１００％／漢字列の長さ）」が用いられる。上記の算出式の「対数の底」には、任意のものを適用できるが、一例としては、自然対数を用いるのが好ましい。また、上記の算出式の「漢字列の長さ」とは、中国語名詞の漢字列の総漢字数を指す。図１２の例で言えば、漢字「劉」の対数尤度として「3.502」が算出され、漢字「京」の対数尤度として「3.466」が算出される。さらに、漢字「李」の対数尤度として「3.335」が算出され、漢字「都」の対数尤度として「2.213」が算出される。さらに、漢字「成」の対数尤度として「2.086」が算出される。 Then, the creation unit 32 calculates the log likelihood of each Chinese character (step SA73). As an example of the log likelihood, a calculation formula “log likelihood = log (appearance frequency * 100% / length of kanji string)” is used. Any value can be applied to the “base of logarithm” of the above calculation formula, but as an example, it is preferable to use a natural logarithm. In addition, “the length of the Chinese character string” in the above calculation formula indicates the total number of Chinese characters in the Chinese character string of the Chinese noun. In the example of FIG. 12, “3.502” is calculated as the log likelihood of the Chinese character “Liu”, and “3.466” is calculated as the log likelihood of the Chinese character “Kyo”. Furthermore, “3.335” is calculated as the log likelihood of the Chinese character “Li”, and “2.213” is calculated as the log likelihood of the Chinese character “Me”. Further, “2.086” is calculated as the log likelihood of the Chinese character “Sei”.

一方、作成部３２は、韓国語の名詞リストについても中国語の名詞リストと同様の処理を実行する。これを説明すると、作成部３２は、他国語名詞記憶部３２ａに記憶された名詞リストのうち韓国語の名詞リストを読み出し、韓国語の名詞リストに搭載されている韓国語の名詞を１つの漢字列に連結する（ステップＳＢ７１）。図１２の例では、作成部３２は、「平壌」、「金大中」…の韓国語の名詞を連結し、１つの韓国語名詞の漢字列「平壌金大中…」とする。 On the other hand, the creation unit 32 executes the same processing for the Korean noun list as for the Chinese noun list. Explaining this, the creation unit 32 reads a Korean noun list from the noun list stored in the foreign language noun storage unit 32a, and converts the Korean nouns included in the Korean noun list into one kanji. Connect to the column (step SB71). In the example of FIG. 12, the creating unit 32 concatenates Korean nouns “Pyongyang”, “Kim Dae Jung”, etc. into a single Korean noun kanji string “Pyongyang Jin Dae Je ...”.

そして、作成部３２は、韓国語名詞の漢字列に含まれる各漢字が漢字列に出現する頻度を算出する（ステップＳＢ７２）。図１２の例で言えば、漢字「金」の出現頻度として「39（回）」が算出され、漢字「釜」の出現頻度として「37（回）」が算出される。さらに、漢字「李」の出現頻度として「33（回）」が算出され、漢字「山」の出現頻度として「22（回）」が算出される。 Then, the creating unit 32 calculates the frequency at which each kanji character included in the kanji string of the Korean noun appears in the kanji string (step SB72). In the example of FIG. 12, “39 (times)” is calculated as the appearance frequency of the Chinese character “Kin”, and “37 (times)” is calculated as the appearance frequency of the Chinese character “Kama”. Further, “33 (times)” is calculated as the appearance frequency of the Chinese character “Li”, and “22 (times)” is calculated as the appearance frequency of the Chinese character “Mountain”.

そして、作成部３２は、各漢字の対数尤度を算出する（ステップＳＢ７３）。図１２の例で言えば、漢字「金」の対数尤度として「4.126」が算出され、漢字「釜」の対数尤度として「3.632」が算出される。さらに、漢字「李」の対数尤度として「3.411」が算出され、漢字「山」の対数尤度として「1.758」が算出される。 Then, the creation unit 32 calculates the log likelihood of each Chinese character (step SB73). In the example of FIG. 12, “4.126” is calculated as the log likelihood of the Chinese character “Gold”, and “3.632” is calculated as the log likelihood of the Chinese character “Kama”. Further, “3.411” is calculated as the log likelihood of the Chinese character “Li”, and “1.758” is calculated as the log likelihood of the Chinese character “Mountain”.

その後、作成部３２は、中国語名詞の漢字列から算出した各漢字の対数尤度を中国語らしさとし、韓国語名詞の漢字列から算出した各漢字の対数尤度を韓国語らしさとする。その上で、作成部３２は、同一の漢字についての中国語らしさ及び韓国語らしさを纏めて単漢字辞書記憶部３１へ登録する（ステップＳ７４）。なお、中国語名詞の漢字列または韓国語名詞の漢字列のうち一方にしか存在しない漢字には、作成部３２は、ゼロを含む最低限の値、例えば「0.500」などを中国語らしさ又は韓国語らしさとして付与した上で単漢字辞書記憶部３１に登録する。 Thereafter, the creation unit 32 sets the log likelihood of each Chinese character calculated from the Chinese character string of the Chinese noun as Chinese, and sets the log likelihood of each Chinese character calculated from the Chinese character string of the Korean noun as Korean. After that, the creation unit 32 collects the Chinese and Korean qualities of the same kanji and registers them in the single kanji dictionary storage unit 31 (step S74). Note that for a Chinese character that exists only in one of a Chinese noun kanji string or a Korean noun kanji string, the creation unit 32 sets a minimum value including zero, for example, “0.500” as Chinese character or Korean. After being given as narrative, it is registered in the single kanji dictionary storage unit 31.

このように、上記のステップＳＡ７１〜ＳＡ７３、ステップＳＢ７１〜ＳＢ７３、ステップＳ７４の処理により、図１１に示した単漢字辞書が作成される。これによって、上記の実施例１における共起辞書と同様に、漢字表記の未知語が他国語である確からしさを評価するための評価基準を定義することができる。なお、上記のステップＳＡ７１〜ＳＡ７３の処理と、上記のステップＳＢ７１〜ＳＢ７３の処理とは、両者を並列して実行することもできるし、いずれを先または後として処理を実行することとしてもかまわない。 In this way, the single kanji dictionary shown in FIG. 11 is created by the processing of steps SA71 to SA73, steps SB71 to SB73, and step S74. As a result, as in the co-occurrence dictionary in the first embodiment, an evaluation criterion for evaluating the probability that an unknown word in Chinese characters is a foreign language can be defined. Note that the processes in steps SA71 to SA73 and the processes in steps SB71 to SB73 can be executed in parallel, or any of them can be executed first or later. .

ここで、一例として、受付部１１から「釜山を訪問する」という入力テキストが入力された場合を想定する。この場合に、解析部１６ａは、図２に示した形態素辞書から入力テキストに含まれる文字列と一致する形態素を検索して、入力テキストを「釜」、「山」、「を」、「訪問」、「する。」という形態素に分割する。このとき、解析部１６ａは、入力テキストに含まれる文字列のうち「釜」及び「山」は実際には「釜山［プサン］」という２文字の単語であるが、形態素辞書には登録されてないので、これらの形態素が単漢字であると認識する。その上で、解析部１６ａは、各形態素に読みを付与し、「釜（単漢字）［カマ］・山（単漢字）［ヤマ］・を［オ］・訪問［ホウモン］・する［スル］。」という形態素解析の結果を得る。続いて、抽出部１６ｂは、「釜」及び「山」が単漢字として認識されているので、これら「釜山」を漢字表記の未知語として抽出する。 Here, as an example, it is assumed that an input text “Visit Busan” is input from the reception unit 11. In this case, the analysis unit 16a searches the morpheme dictionary shown in FIG. 2 for a morpheme that matches the character string included in the input text, and sets the input text to “Kama”, “Mountain”, “O”, “Visit” ”And“ Yes ”. At this time, the analysis unit 16a indicates that “Kama” and “Yama” in the character string included in the input text are actually two-letter words “Busan [Busan]”, but are registered in the morpheme dictionary. Because there are no, these morphemes are recognized as single kanji characters. After that, the analysis unit 16a gives a reading to each morpheme and reads “Kama (single kanji) [Kama], Yama (single kanji) [Yama], [o], visit [Houmon], and [sul]. The result of morphological analysis is obtained. Subsequently, since “Kama” and “Yama” are recognized as single kanji characters, the extraction unit 16 b extracts these “Busan” as unknown words in Kanji notation.

算出部３３ａは、単漢字辞書記憶部３１を用いて、抽出部１６ｂによって抽出された漢字表記の未知語が他国語である確からしさを表す他国語スコアを算出する。ここでは、中国語スコアの算出方法について例示するが、韓国語スコアの算出方法も同様である。これを説明すると、算出部３３ａは、漢字表記の未知語を構成する単漢字に対応する中国語らしさを単漢字辞書から抽出する。そして、算出部３３ａは、各々の単漢字の中国語らしさを合計することにより中国語スコアを算出する。その上で、算出部３３ａは、合計値である中国語スコアを未知語の文字数で除算することにより、正規化された中国語スコアを算出する。このような正規化を行うのは、入力テキストを受け付ける度に未知語の文字数が異なっても１つの閾値で判別部１６ｄにいずれの国の単語であるのかを判別させるためである。 The calculation unit 33a uses the single kanji dictionary storage unit 31 to calculate a foreign language score representing the probability that the unknown word in the kanji notation extracted by the extraction unit 16b is a foreign language. Here, the method for calculating the Chinese score is exemplified, but the method for calculating the Korean score is also the same. Explaining this, the calculation unit 33a extracts the Chinese character corresponding to the single Chinese characters constituting the unknown word in the Chinese character notation from the single Chinese character dictionary. And the calculation part 33a calculates a Chinese score by totaling the Chinese character of each single Chinese character. Then, the calculation unit 33a calculates a normalized Chinese score by dividing the Chinese score, which is the total value, by the number of characters in the unknown word. The reason for such normalization is to allow the determination unit 16d to determine which country the word is based on one threshold even if the number of characters of the unknown word is different each time the input text is received.

上記のテキストの例で言えば、算出部３３ａは、漢字表記の未知語を構成する単漢字「釜」に対応する中国語らしさ「1.478」と、単漢字「山」に対応する中国語らしさ「1.759」とを単漢字辞書から抽出する。そして、算出部３３ａは、単漢字「釜」の中国語らしさ「1.478」と単漢字「山」の中国語らしさ「1.759」とを合計した上で未知語「釜山」の文字数「２」で除算する。これによって、算出部３３ａは、正規化された中国語スコア「1.618」を算出する。一方、算出部３３ａは、漢字表記の未知語を構成する単漢字「釜」に対応する韓国語らしさ「3.632」と、単漢字「山」に対応する韓国語らしさ「1.758」とを単漢字辞書から抽出する。そして、算出部３３ａは、単漢字「釜」の中国語らしさ「3.632」と単漢字「山」の中国語らしさ「1.758」とを合計した上で未知語「釜山」の文字数「２」で除算する。これによって、算出部３３ａは、正規化された韓国語スコア「2.695」を算出する。 In the example of the above text, the calculation unit 33a has the Chinese character “1.478” corresponding to the single kanji character “Kama” constituting the unknown word in the Chinese character notation and the Chinese character “ 1.759 "is extracted from the single kanji dictionary. Then, the calculation unit 33a adds up the Chinese character “1.478” of the single Chinese character “Ku” and the Chinese character “1.759” of the single Chinese character “Yama”, and then divides by the number of characters “2” of the unknown word “Busan”. To do. Thereby, the calculation unit 33a calculates the normalized Chinese score “1.618”. On the other hand, the calculation unit 33a calculates the Korean character “3.632” corresponding to the single kanji character “kama” that constitutes the unknown word of the kanji notation and the Korean character “1.758” corresponding to the single character “yama” to the single kanji dictionary. Extract from Then, the calculation unit 33a adds up the Chinese character “3.632” of the single Chinese character “Ku” and the Chinese character “1.758” of the single Chinese character “Yama”, and then divides by the number of characters “2” of the unknown word “Busan”. To do. Accordingly, the calculation unit 33a calculates a normalized Korean score “2.695”.

このように、算出部３３ａは、中国語スコア「1.618」及び韓国語スコアが「2.695」を算出する。そして、判別部１６ｄが使用する閾値が「2.000」としたとき、韓国語スコア＞閾値＞中国語スコアとなるので、判別部１６ｄは、未知語「釜山」は韓国語であると判別する。このため、読み生成部１６ｅは、他国語辞書記憶部１４ｂによって記憶された「釜［プ］」及び「山［サン］」にしたがって漢字表記の未知語「釜山」に読み「プサン」を生成する。このため、誤った形態素解析の結果「カマヤマ」が未知語「釜山」に付与されることを防止できる。 Thus, the calculation unit 33a calculates a Chinese score “1.618” and a Korean score “2.695”. When the threshold used by the determination unit 16d is “2.000”, the Korean score> threshold> Chinese score, so the determination unit 16d determines that the unknown word “Busan” is Korean. For this reason, the reading generation unit 16e generates the reading “Pusan” for the unknown word “Busan” written in Chinese characters according to “Pug” and “San” stored in the foreign language dictionary storage unit 14b. . For this reason, it is possible to prevent “Kamayama” as a result of erroneous morphological analysis from being given to the unknown word “Busan”.

［処理の流れ］
図１３は、実施例２に係る中国語スコア算出処理の手順を示すフローチャートである。この中国語スコア算出処理は、図８に示したステップＳ１０４に対応する処理であり、抽出部１６ｂによって漢字表記の未知語が抽出された場合に処理が起動する。なお、ここでは、他国語スコアのうち中国語スコアを算出する場合を例示しているが、韓国語スコアを算出する場合も同様である。 [Process flow]
FIG. 13 is a flowchart illustrating a procedure of Chinese score calculation processing according to the second embodiment. This Chinese score calculation process is a process corresponding to step S104 shown in FIG. 8, and starts when an unknown word in Chinese characters is extracted by the extraction unit 16b. Here, the case where the Chinese score is calculated among the other language scores is illustrated, but the same applies to the case where the Korean score is calculated.

図１３に示すように、算出部３３ａは、各種のパラメータを設定する（ステップＳ５０１）。一例としては、算出部３３ａは、中国語スコア及びループカウンタＩをゼロに初期化するともに未知語の文字数Ｎに値を設定する。 As illustrated in FIG. 13, the calculation unit 33a sets various parameters (step S501). As an example, the calculation unit 33a initializes the Chinese score and the loop counter I to zero and sets a value for the number N of unknown words.

そして、算出部３３ａは、ループカウンタＩが未知語の文字数Ｎに等しくなるまで（ステップＳ５０２肯定）、下記のステップＳ５０３〜ステップＳ５０５までの処理を繰り返し実行する。 Then, the calculation unit 33a repeatedly executes the processes from step S503 to step S505 below until the loop counter I becomes equal to the number N of characters of the unknown word (Yes at step S502).

すなわち、算出部３３ａは、未知語のＩ番目の漢字に対応する中国語らしさを単漢字辞書から抽出する（ステップＳ５０３）。そして、算出部３３ａは、Ｉ番目の漢字の中国語スコアを「Ｉ−１」番目までに累積加算していた中国語スコアに累積加算する（ステップＳ５０４）。その後、算出部１６ｃは、ループカウンタＩをインクリメントし（ステップＳ５０５）、上記のステップＳ５０２に移行する。 That is, the calculation unit 33a extracts the Chinese character corresponding to the I-th kanji of the unknown word from the single kanji dictionary (step S503). Then, the calculation unit 33a cumulatively adds the Chinese score of the I-th Chinese character to the Chinese score that has been cumulatively added up to the “I-1” -th (step S504). Thereafter, the calculation unit 16c increments the loop counter I (step S505), and proceeds to the above step S502.

そして、ループカウンタＩが未知語の文字数Ｎに等しくなった場合（ステップＳ５０２否定）には、算出部３３ａは、中国語スコアを未知語の文字数Ｎで除算する（ステップＳ５０６）。このようにして中国語スコアを正規化した後に処理を終了する。 When the loop counter I becomes equal to the number N of unknown word characters (No at Step S502), the calculation unit 33a divides the Chinese score by the number N of unknown word characters (Step S506). After normalizing the Chinese score in this way, the process ends.

［実施例２の効果］
上述してきたように、本実施例に係る音声合成装置３０は、テキストの中に含まれる漢字表記の未知語が日本語以外のいずれの他国語であるのかを他国語スコアにより判別した上で未知語の読みを生成する。それゆえ、本実施例に係る音声合成装置３０では、漢字表記の未知語が他国語である確からしさを他国語スコアとして評価した上で他国語スコアが高い他国語の読みを未知語の読みとして生成できる。このため、本実施例に係る音声合成装置３０では、日本語のテキストの中に漢字表記の中国語、韓国語や台湾語などの他国語が含まれていたとしても、他国語の文字列に誤った日本語の読みを付与することを防止できる。よって、本実施例に係る音声合成装置３０によれば、上記の実施例１と同様に、テキストに含まれる漢字表記の他国語を正確に読み上げることが可能になる。 [Effect of Example 2]
As described above, the speech synthesizer 30 according to the present embodiment determines whether the unknown word in kanji notation included in the text is any other language other than Japanese, based on the foreign language score. Generate word readings. Therefore, in the speech synthesizer 30 according to the present embodiment, the probabilities that an unknown word in Kanji is a foreign language is evaluated as a foreign language score, and a foreign language reading with a high foreign language score is regarded as an unknown word reading. Can be generated. For this reason, in the speech synthesizer 30 according to the present embodiment, even if the Japanese text includes other languages such as Chinese, Korean, Taiwanese, etc., the character strings in the other languages are included. It can prevent giving wrong Japanese readings. Therefore, according to the speech synthesizer 30 according to the present embodiment, as in the first embodiment, it is possible to accurately read out the other language of the Chinese character notation included in the text.

さらに、本実施例に係る音声合成装置３０は、漢字表記の未知語を構成する単漢字が他国語の単語として出現する頻度、尤度または確率を用いて、他国語スコアを算出する。このため、本実施例に係る音声合成装置３０では、未知語の文字構成が他国語である確からしさを他国語スコアにより評価することが可能になる。 Furthermore, the speech synthesizer 30 according to the present embodiment calculates a foreign language score using the frequency, likelihood, or probability of occurrence of a single Chinese character constituting an unknown word in Chinese characters as a foreign language word. For this reason, in the speech synthesizer 30 according to the present embodiment, it is possible to evaluate the probability that the character configuration of the unknown word is a foreign language by using the foreign language score.

さて、これまで開示の装置に関する実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施例を説明する。 Although the embodiments related to the disclosed apparatus have been described above, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, another embodiment included in the present invention will be described below.

［適用範囲］
例えば、上記の実施例１及び実施例２では、開示の装置を音声合成装置１０または３０として実装する場合を例示したが、開示の装置の実装形態はこれに限定されない。例えば、音声合成装置１０または３０に含まれる言語処理部１６または３３の機能だけを言語処理装置として適用することもできる。 [Scope of application]
For example, in the first embodiment and the second embodiment described above, the case where the disclosed apparatus is implemented as the speech synthesis apparatus 10 or 30 is illustrated, but the implementation form of the disclosed apparatus is not limited to this. For example, only the function of the language processing unit 16 or 33 included in the speech synthesizer 10 or 30 can be applied as a language processing device.

また、上記の実施例１及び実施例２では、互いの実施例を単独で実施する場合を例示したが、これらの実施例は好適に組み合わせて実施することができる。一例としては、共起辞書を用いて算出された他国語スコア及び単漢字辞書を用いて算出された他国語スコアに統計処理、例えば相加平均や加重平均などの処理を実行して他国語スコアの信頼性をより高めることもできる。 Moreover, in said Example 1 and Example 2, although the case where each other Example was implemented independently was illustrated, these Examples can be implemented combining suitably. As an example, a multilingual score calculated using a co-occurrence dictionary and a multilingual score calculated using a single kanji dictionary are subjected to statistical processing, for example, arithmetic mean or weighted average, and the other language score. It is also possible to further improve the reliability.

また、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、受付部１１、作成部１５、言語処理部１６、言語処理部３３、韻律生成部１７または合成部１８を音声合成装置１０または３０の外部装置としてネットワーク経由で接続するようにしてもよい。また、受付部１１、作成部１５、言語処理部１６、言語処理部３３、韻律生成部１７または合成部１８を別の装置がそれぞれ有し、ネットワーク接続されて協働することで、上記の音声合成装置１０または３０の機能を実現するようにしてもよい。 In addition, each component of each illustrated apparatus does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the reception unit 11, the creation unit 15, the language processing unit 16, the language processing unit 33, the prosody generation unit 17, or the synthesis unit 18 may be connected as an external device of the speech synthesizer 10 or 30 via a network. In addition, the reception unit 11, the creation unit 15, the language processing unit 16, the language processing unit 33, the prosody generation unit 17, or the synthesis unit 18 each have a separate device, and are connected to the network to cooperate with each other. You may make it implement | achieve the function of the synthesizing | combining apparatus 10 or 30. FIG.

［言語処理プログラム］
また、上記の実施例で説明した各種の処理は、予め用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。そこで、以下では、図１４を用いて、上記の実施例と同様の機能を有する言語処理プログラムを実行するコンピュータの一例について説明する。 [Language processing program]
The various processes described in the above embodiments can be realized by executing a prepared program on a computer such as a personal computer or a workstation. In the following, an example of a computer that executes a language processing program having the same function as that of the above embodiment will be described with reference to FIG.

図１４は、実施例３に係る言語処理プログラムを実行するコンピュータの一例について説明するための図である。図１４に示すように、実施例２におけるコンピュータ１００は、操作部１１０ａと、マイク１１０ｂと、スピーカ１１０ｃと、ディスプレイ１２０と、通信部１３０とを有する。さらに、このコンピュータ１００は、ＣＰＵ１５０と、ＲＯＭ１６０と、ＨＤＤ（Hard Disk Drive）１７０と、ＲＡＭ（Random Access Memory）１８０と有する。これら１１０〜１８０の各部はバス１４０を介して接続される。 FIG. 14 is a schematic diagram illustrating an example of a computer that executes a language processing program according to the third embodiment. As illustrated in FIG. 14, the computer 100 according to the second embodiment includes an operation unit 110 a, a microphone 110 b, a speaker 110 c, a display 120, and a communication unit 130. The computer 100 further includes a CPU 150, a ROM 160, an HDD (Hard Disk Drive) 170, and a RAM (Random Access Memory) 180. These units 110 to 180 are connected via a bus 140.

ＨＤＤ１７０には、図１４に示すように、上記の実施例１又は２で示した言語処理部１６又は３３と同様の機能を発揮する言語処理プログラム１７０が予め記憶される。この言語処理プログラム１７０については、図１又は図１０に示した言語処理部１６又は３３の各構成要素と同様、適宜統合又は分離しても良い。すなわち、ＨＤＤ１７０に格納される各データは、常に全てのデータがＨＤＤ１７０に格納される必要はなく、処理に必要なデータのみがＨＤＤ１７０に格納されれば良い。 As shown in FIG. 14, the HDD 170 stores in advance a language processing program 170 that exhibits the same function as the language processing unit 16 or 33 shown in the first or second embodiment. The language processing program 170 may be integrated or separated as appropriate, similarly to each component of the language processing unit 16 or 33 shown in FIG. In other words, all data stored in the HDD 170 need not always be stored in the HDD 170, and only data necessary for processing may be stored in the HDD 170.

そして、ＣＰＵ１５０が、言語処理プログラム１７０ａをＨＤＤ１７０から読み出してＲＡＭ１８０に展開する。これにより、図１４に示すように、言語処理プログラム１７０ａは、言語処理プロセス１８０ａとして機能する。この言語処理プロセス１８０ａは、ＨＤＤ１７０から読み出した各種データを適宜ＲＡＭ１８０上の自身に割り当てられた領域に展開し、この展開した各種データに基づいて各種処理を実行する。なお、言語処理プロセス１８０は、例えば、図１又は図１０に示した言語処理部１６又は３３にて実行される処理、例えば図８と図９または図１３とに示す処理を含む。なお、ＣＰＵ１５０上で仮想的に実現される各処理部は、常に全ての処理部がＣＰＵ１５０上で動作する必要はなく、処理に必要な処理部のみが仮想的に実現されれば良い。 Then, the CPU 150 reads the language processing program 170 a from the HDD 170 and expands it in the RAM 180. Thereby, as shown in FIG. 14, the language processing program 170a functions as a language processing process 180a. The language processing process 180a expands various data read from the HDD 170 in an area allocated to itself on the RAM 180, and executes various processes based on the expanded various data. The language processing process 180 includes, for example, processing executed by the language processing unit 16 or 33 shown in FIG. 1 or FIG. 10, for example, processing shown in FIG. 8, FIG. 9, or FIG. It should be noted that all the processing units virtually realized on the CPU 150 do not always have to operate on the CPU 150, and only the processing units necessary for the processing need only be virtually realized.

なお、上記の言語処理プログラムについては、必ずしも最初からＨＤＤ１７０やＲＯＭ１６０に記憶させておく必要はない。例えば、コンピュータ１００に挿入されるフレキシブルディスク、いわゆるＦＤ、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させる。そして、コンピュータ１００がこれらの可搬用の物理媒体から各プログラムを取得して実行するようにしてもよい。また、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータ１００に接続される他のコンピュータまたはサーバ装置などに各プログラムを記憶させておき、コンピュータ１００がこれらから各プログラムを取得して実行するようにしてもよい。 The language processing program is not necessarily stored in the HDD 170 or the ROM 160 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk inserted into the computer 100, so-called FD, CD-ROM, DVD disk, magneto-optical disk, or IC card. Then, the computer 100 may acquire and execute each program from these portable physical media. In addition, each program is stored in another computer or server device connected to the computer 100 via a public line, the Internet, a LAN, a WAN, etc., and the computer 100 acquires and executes each program from these. It may be.

以上の実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）所定の辞書を用いて漢字を含んだ文章に形態素解析を実行することにより、前記文章を形態素に分割した上で各形態素に読みを付与する解析部と、
前記解析部による形態素解析の結果から漢字表記の未知語を抽出する抽出部と、
前記抽出部によって抽出された漢字表記の未知語が前記辞書に載る国語とは異なる他国語である確からしさを表す他国語スコアを算出する算出部と、
前記算出部によって算出された他国語スコアに基づいて、前記漢字表記の未知語がいずれの国の単語であるのかを判別する判別部と、
前記判別部による判別結果に応じて、前記漢字表記の未知語の読みを生成する読み生成部と
を有することを特徴とする言語処理装置。 (Supplementary Note 1) An analysis unit that performs reading of each morpheme after dividing the sentence into morphemes by performing morpheme analysis on the sentence including kanji using a predetermined dictionary;
An extraction unit for extracting an unknown word in kanji from the result of morphological analysis by the analysis unit;
A calculation unit that calculates a foreign language score representing the probability that the unknown word of the kanji notation extracted by the extraction unit is a different foreign language from the national language listed in the dictionary;
Based on the other language score calculated by the calculation unit, a determination unit that determines which country the unknown word of the Kanji notation is,
A language processing apparatus comprising: a reading generation unit configured to generate a reading of an unknown word in the Chinese character notation according to a determination result by the determination unit.

（付記２）前記算出部は、前記文章に含まれる前記未知語以外の形態素と、他国語の単語との間における共起関係を用いて、前記他国語スコアを算出することを特徴とする付記１に記載の言語処理装置。 (Additional remark 2) The said calculation part calculates the said foreign language score using the co-occurrence relationship between the morpheme other than the said unknown word contained in the said sentence, and the word of a foreign language. The language processing apparatus according to 1.

（付記３）前記算出部は、前記漢字表記の未知語を構成する単漢字が他国語の単語として出現する頻度、尤度または確率を用いて、前記他国語スコアを算出することを特徴とする付記１に記載の言語処理装置。 (Additional remark 3) The said calculation part calculates the said foreign language score using the frequency, likelihood, or probability that the single Chinese character which comprises the unknown word of the said Chinese character notation appears as a word of another language. The language processing apparatus according to attachment 1.

（付記４）前記読み生成部によって生成された漢字表記の未知語の読みにアクセントを付与する付与部と、
前記付与部によって漢字表記の未知語の読みにアクセントが付与された形態素解析の結果から前記文章の表音文字列を生成する表音生成部と
をさらに有することを特徴とする付記１、２または３に記載の言語処理装置。 (Additional remark 4) The provision part which provides an accent to the reading of the unknown word of the Chinese character notation produced | generated by the said reading production | generation part,
Additional notes 1, 2, or further comprising: a phonetic generation unit that generates a phonetic character string of the sentence from a result of a morphological analysis in which an accent is added to reading of an unknown word written in Chinese characters by the adding unit 4. The language processing device according to 3.

（付記５）所定の辞書を用いて漢字を含んだ文章に形態素解析を実行することにより、前記文章を形態素に分割した上で各形態素に読みを付与する解析部と、
前記解析部による形態素解析の結果から漢字表記の未知語を抽出する抽出部と、
前記抽出部によって抽出された漢字表記の未知語が前記辞書に載る国語とは異なる他国語である確からしさを表す他国語スコアを算出する算出部と、
前記算出部によって算出された他国語スコアに基づいて、前記漢字表記の未知語がいずれの国の単語であるのかを判別する判別部と、
前記判別部による判別結果に応じて、前記漢字表記の未知語の読みを生成する読み生成部と、
前記読み生成部によって生成された漢字表記の未知語の読みにアクセントを付与する付与部と、
前記付与部によって漢字表記の未知語の読みにアクセントが付与された形態素解析の結果から前記文章の表音文字列を生成する表音生成部と、
前記表音生成部によって生成された表音文字列に基づいて前記文章に対応する韻律を生成する韻律生成部と、
前記韻律生成部によって生成された韻律から音声波形を生成して音声を合成する合成部と
を有することを特徴とする音声合成装置。 (Additional remark 5) The analysis part which gives a reading to each morpheme after dividing the above-mentioned sentence into a morpheme by performing a morpheme analysis to a sentence containing Kanji using a predetermined dictionary,
An extraction unit for extracting an unknown word in kanji from the result of morphological analysis by the analysis unit;
A calculation unit that calculates a foreign language score representing the probability that the unknown word of the kanji notation extracted by the extraction unit is a different foreign language from the national language listed in the dictionary;
Based on the other language score calculated by the calculation unit, a determination unit that determines which country the unknown word of the Kanji notation is,
According to the determination result by the determination unit, a reading generation unit that generates a reading of an unknown word in the kanji notation,
An assigning unit that gives an accent to the reading of an unknown word in kanji notation generated by the reading generation unit;
A phonetic generation unit that generates a phonetic character string of the sentence from a result of morpheme analysis in which an accent is added to the reading of an unknown word written in Chinese characters by the adding unit;
A prosody generation unit that generates a prosody corresponding to the sentence based on the phonetic character string generated by the phonetic generation unit;
A speech synthesis apparatus comprising: a synthesis unit that generates a speech waveform from the prosody generated by the prosody generation unit and synthesizes speech.

（付記６）コンピュータが、
所定の辞書を用いて漢字を含んだ文章に形態素解析を実行することにより、前記文章を形態素に分割した上で各形態素に読みを付与し、
前記形態素解析の結果から漢字表記の未知語を抽出し、
前記漢字表記の未知語が前記辞書に載る国語とは異なる他国語である確からしさを表す他国語スコアを算出し、
前記他国語スコアに基づいて、前記漢字表記の未知語がいずれの国の単語であるのかを判別し、
前記判別結果に応じて前記漢字表記の未知語の読みを生成する
処理を実行することを特徴とする言語処理方法。 (Appendix 6)
By performing morphological analysis on sentences containing kanji using a predetermined dictionary, the sentence is divided into morphemes and then given readings to each morpheme,
Extract unknown words in kanji from the result of the morphological analysis,
Calculating a foreign language score representing the certainty that the unknown word in the kanji notation is different from the national language listed in the dictionary;
Based on the other language score, determine which country the unknown word of the kanji notation is,
The language processing method characterized by performing the process which produces | generates the reading of the unknown word of the said Chinese character notation according to the said discrimination | determination result.

（付記７）コンピュータに、
所定の辞書を用いて漢字を含んだ文章に形態素解析を実行することにより、前記文章を形態素に分割した上で各形態素に読みを付与し、
前記形態素解析の結果から漢字表記の未知語を抽出し、
前記漢字表記の未知語が前記辞書に載る国語とは異なる他国語である確からしさを表す他国語スコアを算出し、
前記他国語スコアに基づいて、前記漢字表記の未知語がいずれの国の単語であるのかを判別し、
前記判別結果に応じて前記漢字表記の未知語の読みを生成する
処理を実行させることを特徴とする言語処理プログラム。 (Appendix 7)
By performing morphological analysis on sentences containing kanji using a predetermined dictionary, the sentence is divided into morphemes and then given readings to each morpheme,
Extract unknown words in kanji from the result of the morphological analysis,
Calculating a foreign language score representing the certainty that the unknown word in the kanji notation is different from the national language listed in the dictionary;
Based on the other language score, determine which country the unknown word of the kanji notation is,
A language processing program that executes a process of generating an unknown word reading of the kanji notation according to the discrimination result.

（付記８）コンピュータが、
所定の辞書を用いて、漢字を含んだ複数の文章にそれぞれ形態素解析を実行することにより、各々の文章を形態素に分割し、
前記形態素に分割された文章のうち、前記辞書に載る国語とは異なる他国語の名詞であって漢字によって構成される他国語の名詞を含む文章を抽出し、
前記他国語の名詞を含む文章の形態素であって前記他国語の名詞の直後の形態素である第１の形態素が前記複数の文章に出現する頻度と、前記他国語の名詞及び該他国語の名詞の直後の語を除く形態素である第２の形態素が前記複数の文章に出現する頻度とを算出し、
前記第１の形態素が出現する頻度を用いて、前記第１の形態素と共起関係を結ぶ単語が出現した場合に当該単語が前記他国語である確からしさを表す第１の対数尤度を算出し、
前記第２の形態素が出現する頻度を用いて、前記第２の形態素と共起関係を結ぶ単語が出現した場合に当該単語が前記他国語である確からしさを表す第２の対数尤度を算出し、
前記第１の形態素および第１の対数尤度を対応付けるとともに前記第２の形態素および第２の対数尤度を対応付けることにより、前記第１の形態素および前記第２の形態素の共起辞書を作成する
処理を実行することを特徴とする共起辞書の作成方法。 (Appendix 8) The computer
Using a predetermined dictionary, each sentence is divided into morphemes by performing morphological analysis on each of the sentences containing kanji,
Of the sentences divided into morphemes, extract sentences containing nouns in other languages that are different from the national languages listed in the dictionary and are composed of kanji,
The frequency of the first morpheme of the sentence including the foreign language noun and the morpheme immediately after the foreign language noun appears in the plurality of sentences, the foreign language noun and the foreign language noun Calculating the frequency with which the second morpheme, excluding the word immediately after, appears in the plurality of sentences,
Using a frequency at which the first morpheme appears, when a word that has a co-occurrence relationship with the first morpheme appears, a first log likelihood representing the probability that the word is the other language is calculated And
Using the frequency of appearance of the second morpheme, when a word that has a co-occurrence relationship with the second morpheme appears, a second log likelihood representing the probability that the word is the other language is calculated And
A co-occurrence dictionary of the first morpheme and the second morpheme is created by associating the first morpheme and the first log likelihood and associating the second morpheme and the second log likelihood. A method for creating a co-occurrence dictionary characterized by executing processing.

（付記９）前記コンピュータが、
前記頻度を算出する処理として、前記第１の形態素または前記第２の形態素のうち名詞である形態素を対象に当該形態素が前記複数の文章に出現する頻度を算出する処理を実行することを特徴とする付記８に記載の共起辞書の作成方法。 (Appendix 9) The computer
As the process of calculating the frequency, a process of calculating a frequency at which the morpheme appears in the plurality of sentences is performed on a morpheme that is a noun of the first morpheme or the second morpheme. The method for creating the co-occurrence dictionary according to appendix 8.

（付記１０）前記コンピュータが、
前記第１の対数尤度および前記第２の対数尤度のうち前記第２の対数尤度に前記第１の対数尤度よりも大きい重みを付与する重み付け処理を実行することを特徴とする付記８または付記９に記載の共起辞書の作成方法。 (Appendix 10) The computer
The weighting process which gives a weight larger than said 1st log likelihood to said 2nd log likelihood among said 1st log likelihood and said 2nd log likelihood is performed, It is characterized by the above-mentioned. 8. A method for creating a co-occurrence dictionary according to 8 or appendix 9.

（付記１１）コンピュータが、
漢字かな混じり文を国語とする国とは異なる他国語の名詞であって漢字によって構成される他国語の名詞を複数取得し、
複数取得された他国語の名詞を１つの漢字列に連結し、
連結された漢字列に含まれる単漢字ごとに当該単漢字が前記漢字列に出現する頻度を算出し、
前記単漢字が出現する頻度を用いて、前記単漢字が前記他国語である確からしさを表す対数尤度を算出し、
前記単漢字および前記対数尤度を対応付けることにより、前記単漢字の辞書を作成する
処理を実行することを特徴とする単漢字辞書の作成方法。 (Appendix 11) The computer
Acquire multiple nouns in other languages that are different from the country that uses kanji-kana mixed sentences as the national language,
Concatenate multiple foreign language nouns into one kanji string,
For each single kanji character included in the concatenated kanji character string, calculate the frequency at which the single kanji character appears in the kanji character string,
Using the frequency at which the single kanji appears, calculate a log likelihood representing the probability that the single kanji is the other language,
A method for creating a single Chinese character dictionary, comprising: executing a process of creating a dictionary for the single Chinese character by associating the single Chinese character and the log likelihood.

１０音声合成装置
１１受付部
１２形態素辞書記憶部
１３共起辞書記憶部
１４ａ日本語辞書記憶部
１４ｂ他国語辞書記憶部
１５ａコーパス記憶部
１５作成部
１６言語処理部
１６ａ解析部
１６ｂ抽出部
１６ｃ算出部
１６ｄ判別部
１６ｅ読み生成部
１６ｆ付与部
１６ｇ表音生成部
１７韻律生成部
１８合成部
１９出力部 DESCRIPTION OF SYMBOLS 10 Speech synthesizer 11 Reception part 12 Morphological dictionary storage part 13 Cooccurrence dictionary storage part 14a Japanese dictionary storage part 14b Other language dictionary storage part 15a Corpus storage part 15 Creation part 16 Language processing part 16a Analysis part 16b Extraction part 16c Calculation part 16d Discriminator 16e Reading generator 16f Applier 16g Phonetic generator 17 Prosody generator 18 Synthesizer 19 Output unit

Claims

An analysis unit that gives a reading to each morpheme after dividing the sentence into morphemes by performing morphological analysis on sentences containing kanji using a predetermined dictionary;
An extraction unit for extracting an unknown word in kanji from the result of morphological analysis by the analysis unit;
A calculation unit that calculates a foreign language score representing the probability that the unknown word of the kanji notation extracted by the extraction unit is a different foreign language from the national language listed in the dictionary;
Based on the other language score calculated by the calculation unit, a determination unit that determines which country the unknown word of the Kanji notation is,
A language processing apparatus comprising: a reading generation unit configured to generate a reading of an unknown word in the Chinese character notation according to a determination result by the determination unit.

The said calculation part calculates the said foreign language score using the co-occurrence relationship between the morpheme other than the said unknown word contained in the said sentence, and the word of another language. Language processor.

The said calculation part calculates the said foreign language score using the frequency, likelihood, or probability that the single Chinese character which comprises the unknown word of the said Chinese character notation appears as a word of another language. The language processing device described.

An assigning unit that gives an accent to the reading of an unknown word in kanji notation generated by the reading generation unit;
A phonetic generation unit that generates a phonetic character string of the sentence from a result of a morphological analysis in which an accent is added to reading of an unknown word written in Chinese characters by the adding unit. Or the language processing apparatus of 3.

An analysis unit that gives a reading to each morpheme after dividing the sentence into morphemes by performing morphological analysis on sentences containing kanji using a predetermined dictionary;
An extraction unit for extracting an unknown word in kanji from the result of morphological analysis by the analysis unit;
A calculation unit that calculates a foreign language score representing the probability that the unknown word of the kanji notation extracted by the extraction unit is a different foreign language from the national language listed in the dictionary;
Based on the other language score calculated by the calculation unit, a determination unit that determines which country the unknown word of the Kanji notation is,
According to the determination result by the determination unit, a reading generation unit that generates a reading of an unknown word in the kanji notation,
An assigning unit that gives an accent to the reading of an unknown word in kanji notation generated by the reading generation unit;
A phonetic generation unit that generates a phonetic character string of the sentence from a result of morpheme analysis in which an accent is added to the reading of an unknown word written in Chinese characters by the adding unit;
A prosody generation unit that generates a prosody corresponding to the sentence based on the phonetic character string generated by the phonetic generation unit;
A speech synthesis apparatus comprising: a synthesis unit that generates a speech waveform from the prosody generated by the prosody generation unit and synthesizes speech.

Computer
By performing morphological analysis on sentences containing kanji using a predetermined dictionary, the sentence is divided into morphemes and then given readings to each morpheme,
Extract unknown words in kanji from the result of the morphological analysis,
Calculating a foreign language score representing the certainty that the unknown word in the kanji notation is different from the national language listed in the dictionary;
Based on the other language score, determine which country the unknown word of the kanji notation is,
The language processing method characterized by performing the process which produces | generates the reading of the unknown word of the said Chinese character notation according to the said discrimination | determination result.

On the computer,
By performing morphological analysis on sentences containing kanji using a predetermined dictionary, the sentence is divided into morphemes and then given readings to each morpheme,
Extract unknown words in kanji from the result of the morphological analysis,
Calculating a foreign language score representing the certainty that the unknown word in the kanji notation is different from the national language listed in the dictionary;
Based on the other language score, determine which country the unknown word of the kanji notation is,
A language processing program that executes a process of generating an unknown word reading of the kanji notation according to the discrimination result.