JP2007171275A

JP2007171275A - Language processor and language processing method

Info

Publication number: JP2007171275A
Application number: JP2005365007A
Authority: JP
Inventors: Toshiaki Fukada; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-12-19
Filing date: 2005-12-19
Publication date: 2007-07-05

Abstract

<P>PROBLEM TO BE SOLVED: To provide a language processing method with high portability, capable of correctly providing reading. <P>SOLUTION: A language processor comprises: a detecting means for detecting a character string which is not registered in a word dictionary, from objects to be processed including a plurality of character strings; an acquiring means for acquiring a candidate of reading of each character in the character strings detected by the detecting means by using a Chinese character dictionary; a creating means for creating the candidate of the reading for the whole character strings detected by the detecting means from the candidates of the reading; and a selection means for selecting the reading of the character strings from the candidates of the reading of the character strings by using a pronunciation dictionary. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文字列に対する読み付けを行う言語処理方法に関する。 The present invention relates to a language processing method for reading a character string.

テキストを音声に変換するテキスト音声合成技術や音声認識技術における認識文法もしくは発音辞書の自動作成などにおいて、テキストを正確に読み付ける必要がある。従来、この読み付けを行う方法としては、図３に示すように、単語辞書および単漢字辞書を用いた形態素解析によって行う方法が広く用いられている。例えば「今日は良い天気です」というテキストが入力された場合、形態素解析部３０１は単語辞書３０２を参照することにより「今日（キョウ）／は（ワ）／良い（ヨイ）／天気（テンキ）／で（デ）／す（ス）」という結果を得る。なお、括弧内は読み、／は単語区切りを表す。この結果を得る技術は公知であるため詳細は省略する。この例は、入力テキスト中の単語が全て単語辞書に存在するため正しい読み付け結果が得られる。しかし、「私の名前は山本紗耶香です」という例において（紗耶香は「サヤカ」と読むのが正しいとする）、名前の部分の「紗耶香」が単語辞書にない、すなわち、未登録語として判定されたとする。この場合、「紗耶香」については、３０３の単漢字辞書を用いた読み付けがなされる。いま、単漢字辞書に「紗（シャ）」、「耶（ヤ）」、「香（コウ）」という読みがあるとする。この場合、「私（ワタシ）／の（ノ）／名前（ナマエ）／は（ワ）／山本（ヤマモト）／紗（シャ）／耶（ヤ）／香（コウ）／で（デ）／す（ス）」という結果が得られ、「紗耶香」の部分に対しては、「シャヤコウ」と誤った読み付けがなされる。 It is necessary to read the text correctly in the text-to-speech synthesis technology for converting the text into speech or the recognition grammar in the speech recognition technology or the automatic creation of the pronunciation dictionary. Conventionally, as a method of performing this reading, as shown in FIG. 3, a method of performing morphological analysis using a word dictionary and a single kanji dictionary has been widely used. For example, when the text “Today is good weather” is input, the morphological analysis unit 301 refers to the word dictionary 302 to read “Today (Kyo) / Ha (wa) / Good (Yoi) / Weather (Tenki) / The result of “(de) / su (su)” is obtained. In parentheses, reading is used, and / indicates a word break. Since the technique for obtaining this result is known, the details are omitted. In this example, since all the words in the input text are present in the word dictionary, a correct reading result can be obtained. However, in the example of “My name is Yamamoto Yuka” (assuming that Yuka is correct to read “Sayaka”), the name part “Yuka” is not in the word dictionary, that is, an unregistered word Is determined. In this case, “Rika” is read using 303 single kanji dictionary. Now, suppose that the single kanji dictionary has the readings of “Sha”, “Ya”, and “Kou”. In this case, “I (I) / No (No) / Name (Namae) / Ha (Wa) / Yamamoto (Yamamoto) / Rin (Sha) / Rin (Ya) / Incense (Kou) / de (de) / su The result “(su)” is obtained, and the part of “Mika” is misread as “Shayako”.

このような問題に対して、特許文献１では、未登録語を検知した後、単漢字辞書を用いて読みの組み合わせ候補を求め、ルールを参照することによって組み合わせ候補の中から１つを選択し、選択された１つの結果の読みが単語辞書にあればこの読みを用いる。ここで、ルールとは、「ある文字は音読みされるよりも訓読みされることが多く、訓読みされたときには後ろの文字も訓読みされることが多い」などである。 For such a problem, in Patent Document 1, after detecting an unregistered word, a combination candidate for reading is obtained using a single kanji dictionary, and one of the combination candidates is selected by referring to the rule. If the selected one result reading is in the word dictionary, this reading is used. Here, the rule is, for example, “a certain character is often read aloud rather than read aloud, and when it is read aloud, the character behind it is often read aloud”.

また、特許文献２では、単語の読みの傾向情報を利用した解析方法が開示されている。単語の読みの傾向情報とは、文字数、モーラ数、文法、アクセント型、前後の文字、音訓、清濁などによる影響の情報（単語辞書に含まれる）である。例えば、「黒森」が未登録語として検知された場合、「森」の単語の読み傾向（「森」が最後に付く漢字二文字の名詞はアクセント型が２型で、先頭の文字は訓読みが多い）を利用して、「黒」の読みと全体のアクセント型を決定する。
特開平８−１８５１９７号公報特開平１１−２４９８６６号公報 Further, Patent Document 2 discloses an analysis method that uses word reading tendency information. The word reading tendency information is information (included in the word dictionary) of the influence of the number of characters, the number of mora, the grammar, the accent type, the preceding and following characters, the sound lesson, the clarity, and the like. For example, if “Kuromori” is detected as an unregistered word, the reading tendency of the word “mori” (the two kanji nouns with “mori” at the end are accent type 2 and the first character is kunomi To determine the reading of “black” and the overall accent type.
JP-A-8-185197 Japanese Patent Laid-Open No. 11-249866

特許文献１は未登録語の読みを、単語辞書を参照することによって決定している。単語辞書は通常数万から数十万語の解析に必要な様々な品詞や分野の単語が格納されている。一方、未登録語と判定される単語の多くは固有名詞、特に人名に関するものが多いため、固有名詞以外の単語も数多く含まれる単語辞書を用いると、一般名詞や付属語の読みに誤って合致することがある。例えば、「結衣」が「決意」に合致し「ケツイ」となる（正しくは「ユイ」）、「七海」が「七味」に合致し「シチミ」になる（正しくは「ナナミ」）となるなど人名にふさわしい読み付けがなされなくなる可能性がある。つまり、単漢字辞書から得られる読み付け候補から読みを正確に同定するためには、様々な単語が含まれる単語辞書を用いるのではなく、未登録語となる単語のカテゴリを考慮した発音辞書を用いる必要がある。 In Patent Document 1, reading of an unregistered word is determined by referring to a word dictionary. The word dictionary usually stores words of various parts of speech and fields necessary for analysis of tens of thousands to hundreds of thousands of words. On the other hand, many of the words that are judged as unregistered words are related to proper nouns, especially people's names, so if you use a word dictionary that contains a lot of words other than proper nouns, you will mistakenly read general nouns and attached words. There are things to do. For example, “Yui” matches “Decision” and becomes “Ketsui” (correctly “Yui”), “Nanami” matches “Nanami” and becomes “Sitami” (correctly “Nanami”), etc. There is a possibility that reading appropriate for a person's name will not be made. In other words, in order to accurately identify readings from reading candidates obtained from a single kanji dictionary, rather than using a word dictionary containing various words, a pronunciation dictionary that considers the category of words that are unregistered words is used. It is necessary to use it.

更に、特許文献１および２では、ルールに基づくヒューリスティックな方法で読みを決定している。このため、解析対象となるテキストの分野の変更や人名の名前の表記など流行、廃りがある場合には、ルールの作成や更新を行う手間が大きいという問題がある。本発明は上述の問題を鑑みてなされたもので、ポータビリティが高く、かつ正確に読み付けが行える言語処理方法を提供することを目的としている。 Further, in Patent Documents 1 and 2, reading is determined by a rule-based heuristic method. For this reason, there is a problem that it takes a lot of time to create and update rules when there is a trend or abolishment such as a change in the field of text to be analyzed or a name name notation. The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a language processing method that has high portability and can be read accurately.

上記目的を達成するために、本発明の言語処理装置は、複数の文字列を含む処理対象から、単語辞書に登録されていない文字列を検出する検出手段と、単漢字辞書を用いて前記検出手段で検出された前記文字列中の各文字の読みの候補を取得する取得手段と、前記読み候補から前記検出手段で検出された前記文字列全体に対する読みの候補を生成する生成手段と、発音辞書を用いて、前記文字列読み候補から前記文字列の読みを選択する選択手段とを備える。 In order to achieve the above object, the language processing apparatus of the present invention uses the detection means for detecting a character string not registered in the word dictionary from the processing target including a plurality of character strings, and the detection using the single kanji dictionary. Obtaining means for obtaining reading candidates for each character in the character string detected by the means; generating means for generating reading candidates for the entire character string detected by the detecting means from the reading candidates; and pronunciation Selecting means for selecting reading of the character string from the character string reading candidates using a dictionary;

また上記目的を達成するために、本発明の言語処理装置は、複数の文字列を含む処理対象から、単語辞書に登録されていない文字列を検出する検出手段と、前記検出手段で検出した前記文字列の属性を示す属性情報を取得する取得手段と、前記属性情報に対応する少なくとも１つ以上の単漢字辞書を用いて、前記検出手段で検出した前記文字列中の各文字の読みの候補を生成する生成手段と、前記読み候補から前記文字列全体に対する読みの候補を生成する文字列読み候補生成手段と、前記属性情報に対応する少なくとも２つ以上の発音辞書を用いて、前記文字列読み候補から、前記文字列の読みを選択する選択手段とを備える。 In order to achieve the above object, the language processing apparatus of the present invention detects a character string not registered in the word dictionary from a processing target including a plurality of character strings, and the detection unit detects the character string. Candidates for reading each character in the character string detected by the detecting means using an acquisition means for acquiring attribute information indicating the attribute of the character string and at least one single kanji dictionary corresponding to the attribute information A character string reading candidate generating means for generating reading candidates for the entire character string from the reading candidates, and at least two pronunciation dictionaries corresponding to the attribute information. Selecting means for selecting reading of the character string from reading candidates.

また上記目的を達成するために、本発明の言語処理方法は、複数の文字列を含む処理対象から、単語辞書に登録されていない文字列を検出する検出工程と、単漢字辞書を用いて前記検出工程で検出された前記文字列中の各文字の読みの候補を取得する取得工程と、前記読み候補から前記検出工程で検出された前記文字列全体に対する読みの候補を生成する生成工程と、発音辞書を用いて、前記文字列読み候補から前記文字列の読みを選択する選択工程とを備える。 In order to achieve the above object, the language processing method of the present invention uses a detection step of detecting a character string not registered in the word dictionary from a processing target including a plurality of character strings, and the single kanji dictionary. An acquisition step of acquiring reading candidates for each character in the character string detected in the detection step; a generation step of generating reading candidates for the entire character string detected in the detection step from the reading candidates; And a selection step of selecting reading of the character string from the character string reading candidates using a pronunciation dictionary.

また上記目的を達成するために、本発明の言語処理方法は、複数の文字列を含む処理対象から、単語辞書に登録されていない文字列を検出する検出工程と、前記検出工程で検出した前記文字列の属性を示す属性情報を取得する取得工程と、前記属性情報に対応する少なくとも１つ以上の単漢字辞書を用いて、前記検出工程で検出した前記文字列中の各文字の読みの候補を生成する生成工程と、前記読み候補から前記文字列全体に対する読みの候補を生成する文字列読み候補生成工程と、前記属性情報に対応する少なくとも２つ以上の発音辞書を用いて、前記文字列読み候補から、前記文字列の読みを選択する選択工程とを備える。 In order to achieve the above object, the language processing method of the present invention includes a detection step of detecting a character string not registered in the word dictionary from a processing target including a plurality of character strings, and the detection of the character string detected in the detection step. A candidate for reading each character in the character string detected in the detection step using an acquisition step for acquiring attribute information indicating the attribute of the character string, and at least one single kanji dictionary corresponding to the attribute information A character string reading candidate generating step for generating reading candidates for the entire character string from the reading candidates, and at least two pronunciation dictionaries corresponding to the attribute information, and the character string Selecting a reading of the character string from reading candidates.

本発明によれば、処理対象の文字列に対してより正確に読み付けを行うことが可能となる。 According to the present invention, it is possible to read a character string to be processed more accurately.

以下、図面を参照しながら本発明の好適な実施例について説明していく。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

図２は、本発明の実施例１に係る言語処理装置の構成を示すブロック図である。２０１はＣＰＵで、ＲＯＭ２０２に記憶された制御プログラム或いは外部記憶装置２０４からＲＡＭ２０３にロードされた制御プログラムに従って、本実施形態の言語処理装置における各種制御を行う。ＲＯＭ２０２は各種パラメータやＣＰＵ２０１が実行する制御プログラムなどを格納している。ＲＡＭ２０３は、ＣＰＵ２０１による各種制御の実行時に作業領域を提供するとともに、ＣＰＵ２０１により実行される制御プログラムを記憶する。２０４はハードディスク、フロッピー（登録商標）ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカード等の外部記憶装置である。この外部記憶装置がハードディスクの場合には、ＣＤ−ＲＯＭやフロッピー（登録商標）ディスク等からインストールされた各種プログラムが記憶されている。２０５はテンキー、ボタン、タッチパネル、キーボード、マウス、マイクロフォン、ペンなど外部操作によるテキスト情報の入力や選択等を行うための入力装置である。入力装置は言語処理装置に直接取り付けられた形態でもよいし、赤外線、無線ＬＡＮ、インターネット、電話回線等の通信を介したリモコン、コンピュータ、携帯電話等を用いて、言語処理装置の外部から操作する形態でもよい（通信部分に関する装置は省略）。またこれらを組み合わせた形態でもよい。２０６はＣＲＴ、液晶ディスプレイ、スピーカなどによる出力装置である。２０７は上記各部を接続するバスである。なお、言語処理の対象となるテキストは２０５によって入力もしくは選択する以外にも、２０２、２０３、２０４に保持されているものであってもよいし、赤外線、無線ＬＡＮ、インターネット、電話回線などの通信を介して得られるものであってもよい。 FIG. 2 is a block diagram illustrating the configuration of the language processing apparatus according to the first embodiment of the present invention. A CPU 201 performs various controls in the language processing apparatus according to the present embodiment in accordance with a control program stored in the ROM 202 or a control program loaded from the external storage device 204 to the RAM 203. The ROM 202 stores various parameters, a control program executed by the CPU 201, and the like. The RAM 203 provides a work area when various controls are executed by the CPU 201 and stores a control program executed by the CPU 201. Reference numeral 204 denotes an external storage device such as a hard disk, a floppy (registered trademark) disk, a CD-ROM, a DVD-ROM, or a memory card. When the external storage device is a hard disk, various programs installed from a CD-ROM, a floppy (registered trademark) disk or the like are stored. Reference numeral 205 denotes an input device for inputting and selecting text information by an external operation such as a numeric keypad, a button, a touch panel, a keyboard, a mouse, a microphone, and a pen. The input device may be directly attached to the language processing device, or may be operated from the outside of the language processing device using a remote controller, computer, mobile phone, etc. via communication such as infrared, wireless LAN, Internet, telephone line, etc. A form may be sufficient (the apparatus regarding a communication part is abbreviate | omitted). Moreover, the form which combined these may be sufficient. An output device 206 includes a CRT, a liquid crystal display, a speaker, and the like. Reference numeral 207 denotes a bus for connecting the above-described units. The text to be subjected to language processing may be stored in 202, 203, 204 in addition to being input or selected by 205, or communication such as infrared, wireless LAN, Internet, telephone line, etc. It may be obtained via.

図１は、言語処理方法のモジュール構成を示したブロック図である。１０１は文字列検出部で、解析対象のテキストを１０５の単語辞書を用いて解析し、１０５に含まれない文字列を検出する。１０２は文字読み候補生成部で、１０１で検出された文字列に対して１０６の単漢字辞書を用いて各文字の読みの候補を生成する。１０３は文字列読み候補生成部で、１０２で生成される文字の候補から文字列全体の読みの候補を生成する。１０４は文字列読み選択部で、１０３で生成される文字列の読みの候補に対して１０７の発音辞書を用いて文字列の読みを選択し、解析結果を得る。なお、１０１で解析可能な文字列については、１０２から１０４の処理を行わず、１０１から解析結果を得る。 FIG. 1 is a block diagram showing the module configuration of the language processing method. A character string detection unit 101 analyzes the text to be analyzed using the word dictionary 105 and detects a character string not included in 105. A character reading candidate generation unit 102 generates a reading candidate for each character by using the single kanji dictionary 106 for the character string detected in 101. A character string reading candidate generation unit 103 generates reading candidates for the entire character string from the character candidates generated in 102. A character string reading selection unit 104 selects a character string reading using the pronunciation dictionary 107 for the character string reading candidates generated in 103, and obtains an analysis result. For the character string that can be analyzed in 101, the analysis result is obtained from 101 without performing the processing from 102 to 104.

次に、本実施例における処理フローを説明する。ここでは、「私の名前は山本紗耶香です」というテキストを解析して読み付けを行う場合を例にとって説明する。 Next, a processing flow in the present embodiment will be described. Here, a case where the text “My name is Yuka Yamamoto” is analyzed and read is described as an example.

図４は、このテキストを解析する際のフローチャートである。まず、ステップＳ４０１において、解析対象となるテキストを取得する。次に、ステップＳ４０２において、文切りや引用記号などの不要な文字の削除等、テキスト解析を行う前の前処理を行う。ステップＳ４０３では、単語辞書１０５を用いて、前処理後のテキストを単語形態素に分割（形態素解析）する。 FIG. 4 is a flowchart for analyzing this text. First, in step S401, a text to be analyzed is acquired. Next, in step S402, pre-processing prior to text analysis, such as sentence deletion and deletion of unnecessary characters such as quotation marks, is performed. In step S403, the pre-processed text is divided into word morphemes (morpheme analysis) using the word dictionary 105.

単語辞書の一例を図６に示す。第１カラムから第５カラムは、それぞれ、単語ＩＤ、表記、読み、品詞、スコアである。スコアは、各単語が生起する確率の対数をとったものであり、何らかの言語コーパスを用いた単語ユニグラムなどによって求められたものである。形態素解析の方法はいかなるものを用いてもよいが、単語辞書で検索されたものから単語ラティスを生成し、これを最長一致法やコスト最小法（図６のスコアの場合、スコア最大法）などの基準を用いて解析結果を得ることができる。 An example of the word dictionary is shown in FIG. The first to fifth columns are the word ID, notation, reading, part of speech, and score, respectively. The score is a logarithm of the probability of occurrence of each word, and is obtained by a word unigram using some kind of language corpus. Any method of morphological analysis may be used, but a word lattice is generated from what is searched in a word dictionary, and this is used as a longest match method or a minimum cost method (in the case of the score in FIG. 6, the maximum score method). An analysis result can be obtained using the above criteria.

「私の名前は山本紗耶香です」というテキストを図６の単語辞書を用いて形態素解析を行った場合の単語ラティスの例を図７に示す。ここで、「名前」は「名前」（単語ＩＤ＝３）、もしくは「名」（単語ＩＤ＝４）と「前」（単語ＩＤ＝５）として検索され、「山本」は「山本」（単語ＩＤ＝７）、もしくは「山」（単語ＩＤ＝８）と「本」（単語ＩＤ＝９）として検索される。また、「沙耶香」の部分は単語辞書には存在しないため未登録語となる。この単語ラティスに対して、最長一致法もしくはコスト最小法を適用すれば、図７の実線で示される経路、すなわち、「私／の／名前／は／山本／沙耶香／で／す」という単語分割がなされる。ここで、図６の単語辞書には表記の他に読みや品詞の情報があるため、「沙耶香」以外の単語については、単語分割に加えて、読みと品詞の情報も解析結果として得ることができる。 FIG. 7 shows an example of a word lattice when a morphological analysis is performed on the text “My name is Yuka Yamamoto” using the word dictionary of FIG. Here, “name” is searched as “name” (word ID = 3), or “name” (word ID = 4) and “previous” (word ID = 5), and “yamamoto” is searched for “yamamoto” (word ID = 7) or “mountain” (word ID = 8) and “book” (word ID = 9). In addition, since “Sayaka” is not in the word dictionary, it is an unregistered word. If the longest match method or the minimum cost method is applied to this word lattice, the path indicated by the solid line in FIG. 7, that is, the word division of “I / no / name / hae / Yamamoto / Sayaka / de / su”. Is made. Here, since the word dictionary of FIG. 6 has reading and part-of-speech information in addition to the notation, in addition to word division, reading and part-of-speech information can also be obtained as an analysis result for words other than “Sayaka”. it can.

次に、ステップＳ４０４において、カウンタｉを１と設定し、Ｉに単語数、すなわち、この例では「沙耶香」も含めて単語数が８であるため、Ｉ＝８と設定する。次に、ステップＳ４０５で全ての単語に対してステップＳ４１０の解析結果出力がなされたか否かを判定する。ｉ＞Ｉの場合は処理を終了し、それ以外の場合はステップＳ４０６へ進む。ステップＳ４０６では、ｉ番目の単語ｗ（ｉ）が未登録語であるか否かを判定する。未登録語でない場合にはステップＳ４１０へ進み解析結果（表記、読み、品詞など）を出力し、未登録語の場合（上記例では「沙耶香」（ｗ（６））の場合）には、ステップＳ４０７へ進む。ステップＳ４０７では単漢字辞書１０６を検索して、未登録語の各文字に対する読みの候補を生成する。次に、ステップＳ４０８で、Ｓ４０７で生成された文字読み候補を用いて、未登録語全体の文字列に対する読みの候補を生成する。次に、ステップＳ４０９で発音辞書１０７を検索して、Ｓ４０８で生成された文字列読み候補から文字列の読みを同定し、選択する。次に、Ｓ４１０で解析結果を出力し、「沙耶香」の場合は、「沙耶香／サヤカ／未登録語」などと出力する。次にステップＳ４１１でカウンタｉを１つインクリメントし、Ｓ４０５へ戻る。 Next, in step S404, the counter i is set to 1, and since the number of words is 8 in I, including “Sayaka” in this example, I = 8. Next, in step S405, it is determined whether or not the analysis result output in step S410 has been made for all words. If i> I, the process ends. Otherwise, the process proceeds to step S406. In step S406, it is determined whether or not the i-th word w (i) is an unregistered word. If it is not an unregistered word, the process proceeds to step S410 and an analysis result (notation, reading, part of speech, etc.) is output. If it is an unregistered word (in the above example, “Sayaka” (w (6))), step The process proceeds to S407. In step S407, the single-kanji dictionary 106 is searched to generate reading candidates for each character of the unregistered word. Next, in step S408, using the character reading candidates generated in S407, reading candidates for the character strings of the entire unregistered word are generated. Next, the pronunciation dictionary 107 is searched in step S409, and the character string reading is identified and selected from the character string reading candidates generated in S408. Next, the analysis result is output in S410, and in the case of “Sayaka”, “Sayaka / Sayaka / unregistered word” or the like is output. Next, in step S411, the counter i is incremented by 1, and the process returns to S405.

図５はＳ４０７からＳ４０９における未登録語に対する読み付けを行う処理のフローチャートであり、前記未登録語の「沙耶香」から「サヤカ」という読み付け結果を得る処理を詳細に説明する。ステップＳ５０１において、カウンタｊを１と設定し、Ｊにｗ（ｉ）の文字数、すなわち、ｗ（６）の「沙耶香」の場合、３文字であるため、Ｊ＝３と設定する。次に、ステップＳ５０２で全ての文字に対して読みの候補が生成されたか否かを判定する。ｊ＞Ｊの場合はステップＳ５０５へ進み、それ以外の場合はステップＳ５０３へ進む。Ｓ５０３ではｊ番目の文字ｃ（ｊ）に対する読みｒ（ｊ）を単漢字辞書１０６を検索することによって生成する。単漢字辞書の一例を図８に示す。第１カラムから第３カラムは、それぞれ、単漢字ＩＤ、表記、読みであり、複数の読みがある場合には「／」を区切り記号として複数記述している。ｃ（１）、すなわち、「沙」の読み候補ｒ（１）は、図８より「シャ」もしくは「サ」と生成される。ステップＳ５０４でカウンタｊを１つインクリメントし、Ｓ５０２へ戻り、同様の処理を繰り返すことによって、「耶」の読み候補は「ヤ」、「ジャ」、「シャ」となり、「香」の読み候補は「コウ」、「キョウ」、「カ」、「カオ」となる。ステップＳ５０５では、これらの文字読み候補からｗ（ｉ）の読み候補、すなわち文字列読み候補ｔ（ｋ）を生成する。例えば、前記例の場合、文字読み候補を発音ラティスとして表現すると図１０（ａ）のようになり、このラティスの全ての経路を展開した図１０（ｂ）で示される２４種類の文字列読み候補をｔ（ｋ）とする。ステップＳ５０６では、カウンタｋを１と設定し、Ｋにｗ（６）の読み候補数、すなわちＫ＝２４と設定し、発音の最大スコアｍａｘＳｃｏｒｅとそのインデックスｍａｘＩＤをそれぞれｍａｘＳｃｏｒｅ＝−１０００、ｍａｘＩＤ＝１などと初期化する。ステップＳ５０７で全ての文字列読み候補に対して検索を行ったか否かを判定する。ｋ＞Ｋの場合はステップＳ５１２へ進み、それ以外の場合はステップＳ５０８へ進む。ステップＳ５０８ではｔ（ｋ）のスコアを発音辞書１０７から検索する。 FIG. 5 is a flowchart of processing for reading unregistered words in S407 to S409. Processing for obtaining a reading result “Sayaka” from “Sayaka” of the unregistered words will be described in detail. In step S501, the counter j is set to 1, and if J is the number of characters of w (i), that is, “Sayaka” of w (6), there are three characters, so J = 3. Next, in step S502, it is determined whether reading candidates have been generated for all characters. If j> J, the process proceeds to step S505; otherwise, the process proceeds to step S503. In S503, a reading r (j) for the j-th character c (j) is generated by searching the single kanji dictionary 106. An example of a single kanji dictionary is shown in FIG. The first column to the third column are a single Chinese character ID, notation, and reading, respectively, and when there are a plurality of readings, a plurality of “/” are described as delimiters. The reading candidate r (1) for c (1), that is, “sha” is generated as “sha” or “sa” from FIG. In step S504, the counter j is incremented by 1, and the process returns to S502. By repeating the same processing, the reading candidates for “耶” become “ya”, “ja”, and “sha”, and the reading candidates for “scent” are “Kou”, “Kyo”, “K”, “Kao”. In step S505, w (i) reading candidates, that is, character string reading candidates t (k) are generated from these character reading candidates. For example, in the case of the above example, the character reading candidates are expressed as pronunciation lattices as shown in FIG. 10A, and the 24 types of character string reading candidates shown in FIG. 10B in which all paths of the lattice are expanded are shown. Is t (k). In step S506, the counter k is set to 1, K is set to the number of reading candidates for w (6), that is, K = 24, and the maximum pronunciation score maxScore and its index maxID are set to maxScore = −1000 and maxID = 1, respectively. And so on. In step S507, it is determined whether or not all character string reading candidates have been searched. If k> K, the process proceeds to step S512. Otherwise, the process proceeds to step S508. In step S508, the pronunciation dictionary 107 is searched for the score of t (k).

発音辞書の一例を図９に示す。第１カラムから第３カラムは、それぞれ、発音ＩＤ、読み、スコアであり、スコアは値が大きいほどその読みが生起しやすいことを表わす。図１０（ｂ）の候補ＩＤの順にｔ（ｋ）を検索する場合、ｔ（１）の発音は発音辞書に含まれない。このようにｔ（ｋ）の読みが発音辞書に存在しない場合のスコアを−１０００とすると、当該スコアｃｕｒＳｃｏｒｅ＝−１０００となるため、ステップＳ５０９の判定の結果、ステップＳ５１１へ進み、カウンタｋを１つインクリメントし、Ｓ５０７へ戻る。ｔ（２）からｔ（４）の発音も発音辞書に含まれないため同じ処理を繰り返す。次に、ｔ（５）の「サヤコウ」については発音辞書に存在するため、そのスコアをｃｕｒｅＳｃｏｒｅ＝−３０と設定する。この場合、Ｓ５０９の条件を満たすため、ｍａｘＳｃｏｒｅ＝−３０、ｍａｘＩＤ＝５と設定される。次のｔ（６）の発音は発音辞書に含まれず、ｔ（７）の「サヤカ」は発音辞書に存在し、ｃｕｒＳｃｏｒｅ＝−７であるため、Ｓ５１０でｍａｘＳｃｏｒｅ＝−７、ｍａｘＩＤ＝７と設定される。以下、ｔ（８）からｔ（２４）は全て発音辞書に存在しないため、Ｓ５１２では、ｗ（６）＝「沙耶香」の読みをｔ（７）＝「サヤカ」と同定し、処理を終える。 An example of the pronunciation dictionary is shown in FIG. The first column to the third column are a pronunciation ID, a reading, and a score, respectively. The larger the value of the score, the easier the reading occurs. When searching for t (k) in the order of candidate IDs in FIG. 10B, the pronunciation of t (1) is not included in the pronunciation dictionary. Assuming that the score when t (k) reading does not exist in the pronunciation dictionary is −1000, the score curScore = −1000 is obtained. As a result of the determination in step S509, the process proceeds to step S511, and the counter k is set to 1. Is incremented by one and the process returns to S507. Since the pronunciation from t (2) to t (4) is not included in the pronunciation dictionary, the same process is repeated. Next, since “Sayako” at t (5) exists in the pronunciation dictionary, the score is set as cureScore = −30. In this case, since the condition of S509 is satisfied, maxScore = −30 and maxID = 5 are set. The pronunciation of the next t (6) is not included in the pronunciation dictionary, and “Sayaka” of t (7) exists in the pronunciation dictionary and curScore = −7. Therefore, in S510, maxScore = −7 and maxID = 7 are set. Is done. Hereinafter, since t (8) to t (24) do not exist in the pronunciation dictionary, in S512, the reading of w (6) = “Sayaka” is identified as t (7) = “Sayaka”, and the process ends.

本実施例では、Ｓ５１２において、最もスコアの高いものを出力していたが、本発明はこれに限らず、スコアの高いものから複数候補を出力することも可能である。また、本実施例では、単漢字辞書と発音辞書はそれぞれ１種類のものを用いた例について説明したが、本発明はこれに限らず、複数の単漢字辞書と発音辞書を用いることも可能である。例えば、人名の読み付けにおいて、男性の名前用の単漢字辞書と発音辞書の組と、女性の名前用の単漢字辞書と発音辞書の組の２種類を用意すれば、それぞれの読み付け候補を用いて読みを同定することができる。同様に、性別の他にも、世代や地域などに違いに応じて単漢字辞書と発音辞書を複数用意し、これを用いて処理を行うことも可能である。また、本実施例では、「私の名前は山本紗耶香です」という一文が解析対象テキストであったが、本発明はこれに限らず、「紗耶香」など１単語のみ、あるいは１つの句で同様の処理を適用することができる。 In the present embodiment, the highest score is output in S512. However, the present invention is not limited to this, and a plurality of candidates can be output from the highest score. In this embodiment, an example in which a single kanji dictionary and a pronunciation dictionary are used is described. However, the present invention is not limited to this, and a plurality of single kanji dictionary and pronunciation dictionary can be used. is there. For example, in reading human names, if you prepare two types, a single kanji dictionary and pronunciation dictionary set for male names, and a single kanji dictionary and pronunciation dictionary set for female names, each reading candidate can be selected. Can be used to identify readings. Similarly, in addition to gender, it is also possible to prepare a plurality of single kanji and pronunciation dictionaries according to differences in generations, regions, etc., and perform processing using these. In this embodiment, a sentence “My name is Yuka Yamamoto” was the text to be analyzed. However, the present invention is not limited to this, and only one word such as “Mika” or one phrase is used. Similar processing can be applied.

本実施例で得られる読み付け結果は、テキスト音声合成における未登録語の読み付けに用いることができる。更に、本実施例で得られる読み付け結果は、音声認識における未登録語に対する発音辞書もしくは音声認識文法の作成に用いることができる。 The reading results obtained in this embodiment can be used for reading unregistered words in text-to-speech synthesis. Furthermore, the reading result obtained in this embodiment can be used to create a pronunciation dictionary or a speech recognition grammar for an unregistered word in speech recognition.

以上の説明から明らかなように、本実施例によれば、未登録語に対する読みの候補を単漢字辞書を用いて生成し、この候補を単語辞書とは異なる発音辞書を用いて読みを同定するため、ポータビリティが高く、かつ正確に読み付けを行うことが可能となる。 As is clear from the above description, according to the present embodiment, candidate readings for unregistered words are generated using a single kanji dictionary, and the candidate is identified using a pronunciation dictionary different from the word dictionary. Therefore, portability is high and reading can be performed accurately.

また、テキスト音声合成の品質や音声認識の発音辞書の自動作成における精度が向上する。更に、解析対象テキストの分野の変更や表記方法の変化に対して容易に対応することが可能となる。 In addition, the quality of text-to-speech synthesis and the accuracy in automatic creation of a pronunciation dictionary for speech recognition are improved. Furthermore, it is possible to easily cope with changes in the field of analysis target text and changes in the notation method.

前記実施例で用いた単漢字辞書は図８に示されるように、単漢字ＩＤ、表記、読みの情報を含んだものであったが、本発明はこれに限らず、単漢字辞書にスコア情報を含んだ場合においても適用することができる。 As shown in FIG. 8, the single kanji dictionary used in the above embodiment includes single kanji ID, notation, and reading information. However, the present invention is not limited to this, and score information is included in the single kanji dictionary. It can be applied even in the case of including.

図１１は、スコア情報付きの単漢字辞書の例である。第１カラムから第４カラムは、それぞれ、単漢字ＩＤ、表記、読み、スコアであり、複数の読みがある場合には「／」を区切り記号として複数記述している。また、スコアは値が大きいほどその読みが生起しやすいことを表わす。単漢字辞書にスコア情報を含んだ場合の未登録語に対する読み付け処理は、基本的に前実施例と同様であるため、図５における違いのみを説明する。 FIG. 11 is an example of a single kanji dictionary with score information. The first column to the fourth column are a single Chinese character ID, a notation, a reading, and a score, respectively. When there are a plurality of readings, a plurality of “/” is described as a delimiter. The score indicates that the larger the value is, the easier the reading occurs. The reading process for an unregistered word when score information is included in the single kanji dictionary is basically the same as in the previous embodiment, and only the differences in FIG. 5 will be described.

まず、Ｓ５０３において、ｃ（ｊ）の読み候補ｒ（ｊ）の生成は、図１１に示されるようなスコア情報付きの単漢字辞書１０６を用いて行う。図１２は、図１１の単漢字辞書から得られる文字列読み候補の例であり、Ｓ５０５におけるｗ（ｉ）の読み候補ｔ（ｋ）の生成例である。図１２のスコアは、図１１における単漢字の読みに対するスコアの和であり、例えば、候補ＩＤ＝１の「シャヤコウ」のスコアは、図１１の「シャ」、「ヤ」、「コウ」のスコアがそれぞれ−４、−５、−３であるため、−４−５−３＝−１２となる。また、Ｓ５０８では発音辞書１０７におけるスコアと単漢字辞書から得られるスコアの和をｃｕｒＳｃｏｒｅとする。ここで、ｔ（ｋ）の読みが発音辞書に存在しない場合の発音辞書のスコアを−５００などとする。これによって、全ての読み候補が発音辞書に存在しない場合には、単漢字辞書のスコアに基づいてｗ（ｉ）の発音が同定される。 First, in S503, the reading candidate r (j) of c (j) is generated using the single kanji dictionary 106 with score information as shown in FIG. FIG. 12 is an example of a character string reading candidate obtained from the single kanji dictionary of FIG. 11, and is an example of generating w (i) reading candidate t (k) in S505. The score of FIG. 12 is the sum of the scores for the reading of a single kanji character in FIG. 11. For example, the score of “SHAYAKO” with candidate ID = 1 is the score of “SHA”, “YA”, “KO” in FIG. Are −4, −5, and −3, respectively, so that −4−5−3 = −12. In S508, the sum of the score in the pronunciation dictionary 107 and the score obtained from the single kanji dictionary is curScore. Here, the score of the pronunciation dictionary when the reading of t (k) does not exist in the pronunciation dictionary is -500 or the like. As a result, when not all reading candidates exist in the pronunciation dictionary, the pronunciation of w (i) is identified based on the score of the single kanji dictionary.

なお、本実施例では、単漢字辞書から得られるスコアと発音辞書のスコアの和を用いたが、本発明はこれに限らず、重み付け和、積などいかなる演算を行ってスコアを計算してもよい。すなわち、発音辞書のスコアを全く用いず、単漢字辞書から得られるスコアのみを用いてもよい。この場合には、図９に示されるスコアを発音辞書に保持する必要はなくなる。 In the present embodiment, the sum of the score obtained from the single kanji dictionary and the score of the pronunciation dictionary is used, but the present invention is not limited to this, and the score can be calculated by performing any operation such as weighted sum and product. Good. That is, only the score obtained from the single kanji dictionary may be used without using the score of the pronunciation dictionary at all. In this case, it is not necessary to store the score shown in FIG. 9 in the pronunciation dictionary.

前記実施例では、解析対象のテキストに対して、単語辞書１０５を用いたテキスト解析を行い、未登録語と判定された文字列に対して、読み付けを行っていた。ここで、未登録語の文字列が、例えば、人名に関するもの、地名に関するもの、企業名などの組織に関するものなど、文字列の属性が取得できれば、これらの属性に対応した発音辞書を利用することによって、より精度の高い読み付けを行うことが可能である。図１３は、未登録語の文字列の属性を取得し、これを利用した読み付けを行う言語処理方法のモジュール構成を示したブロック図である。 In the above embodiment, text analysis using the word dictionary 105 is performed on the text to be analyzed, and character strings determined as unregistered words are read. Here, if the character strings of unregistered words can be obtained, for example, those related to names of people, places related to places, organizations related to company names, etc., use the pronunciation dictionary corresponding to these attributes. Therefore, it is possible to perform reading with higher accuracy. FIG. 13 is a block diagram showing a module configuration of a language processing method for acquiring an attribute of a character string of an unregistered word and performing reading using the attribute.

１３０１は文字列検出部で、解析対象のテキストを１３０６の単語辞書を用いて解析し、１３０６に含まれない文字列を検出する。１３０２は文字列属性取得部で、１３０６に含まれない文字列に対する属性を取得する。１３０３は文字読み候補生成部で、１３０１で検出された文字列に対して１３０６の単漢字辞書を用いて各文字の読みの候補を生成する。１３０４は文字列読み候補生成部で、１３０３で生成される文字の候補から文字列全体の読みの候補を生成する。１３０５は文字列読み選択部で、１３０４で生成される文字列の読みの候補と１３０２で取得される文字列の属性情報から複数の発音辞書（この例では１３０８と１３０９の２つの発音辞書）を用いて文字列の読みを同定して選択し、解析結果を得る。なお、１３０１で解析可能な文字列については、１３０３から１３０５の処理を行わず、１３０２の文字列属性の取得後、解析結果を得る。 A character string detection unit 1301 analyzes the text to be analyzed using the word dictionary 1306 and detects a character string not included in 1306. A character string attribute acquisition unit 1302 acquires an attribute for a character string not included in 1306. A character reading candidate generation unit 1303 generates a reading candidate for each character using the single kanji dictionary 1306 for the character string detected in 1301. A character string reading candidate generation unit 1304 generates reading candidates for the entire character string from the character candidates generated in 1303. Reference numeral 1305 denotes a character string reading selection unit which selects a plurality of pronunciation dictionaries (in this example, two pronunciation dictionaries 1308 and 1309) from the character string reading candidates generated in 1304 and the character string attribute information acquired in 1302. Use to identify and select the reading of the string and obtain the analysis results. For the character string that can be analyzed in 1301, the processing from 1303 to 1305 is not performed, and the analysis result is obtained after obtaining the character string attribute of 1302.

本実施の形態における処理フローは、前記実施例で述べた図４および図５とほぼ同様であるため、違いのみについて説明する。また、前実施例と同じく、「私の名前は山本紗耶香です」というテキストを解析して読み付けを行う場合を例にとって説明する。また、１３０８の発音辞書１と１３０９の発音辞書２は、それぞれ、人名に関するものと地名に関するものであるとする。図６と同様の単語辞書１３０６を用いることによって、前記テキストは「私／の／名前／は／山本／沙耶香／で／す」という単語分割がなされ、「沙耶香」という文字列が未登録語として検出される。次に、１３０２において、「沙耶香」という文字列に関する属性を取得する。この例の場合は、「沙耶香」という文字列は、人名もしくは姓名の名などといった属性を取得する。この属性の取得は様々な方法が考えられるが、例えば、「山本」という単語が人名もしくは姓名の姓であるという情報を１３０６から取得することによって、「沙耶香」が人名もしくは姓名の名に関することが取得できる。他にも、「「名前は」という単語と「です」という単語の間に未登録語があるためその間の文字列は人名である」といった知識を利用することによって、「沙耶香」が人名であることが取得できる。その他、テキストの解析結果から推定するのではなく、例えば、人名に関するフィールドに対するテキストであるなどのアプリケーションの属性を用いて未登録語の属性を取得することや、ユーザによって属性を指定することも可能である。 The processing flow in the present embodiment is almost the same as that in FIGS. 4 and 5 described in the above embodiment, and only the differences will be described. Further, as in the previous embodiment, a case where the text “My name is Yuka Yamamoto” is analyzed and read will be described as an example. Further, it is assumed that the pronunciation dictionary 1 1308 and the pronunciation dictionary 2 1309 are related to a person name and a place name, respectively. By using the same word dictionary 1306 as in FIG. 6, the text is divided into words “I / no / name / ha / yamamoto / saika / de / su” and the character string “saika” is an unregistered word. Detected. Next, in 1302, an attribute related to the character string “Sayaka” is acquired. In the case of this example, the character string “Sayaka” acquires an attribute such as a person's first name or last name. There are various ways to acquire this attribute. For example, by acquiring information that the word “Yamamoto” is a surname of a person name or a surname from 1306, “Sayaka” is related to the name of a person name or a surname. You can get it. In addition, “Sayaka” is a personal name by using knowledge such as “There is an unregistered word between the word“ name ”and the word“ is ”and the character string between them is a personal name”. Can get. Besides, it is not estimated from the analysis result of text, but it is also possible to acquire the attribute of unregistered word using the attribute of the application such as text for the field related to the person name, or specify the attribute by the user It is.

１３０３および１３０４の処理は、それぞれ１０２および１０３の処理と同様であるため説明は省略する。次に、１３０５で文字列の読みを同定する際に、１３０２で取得された属性に関する発音辞書を用いる。この例では、「沙耶香」の文字列の属性が人名であるので、１３０８の人名に関する発音辞書１を用いて読み付けを行う。 Since the processes 1303 and 1304 are the same as the processes 102 and 103, respectively, description thereof will be omitted. Next, when the character string reading is identified in 1305, the pronunciation dictionary relating to the attribute acquired in 1302 is used. In this example, since the attribute of the character string “Sayaka” is a person name, the pronunciation dictionary 1 relating to the person name 1308 is used for reading.

本実施例では、発音辞書は人名に関するものと地名に関するものであったが、本発明はこれに限らず、例えば、姓と名、男性と女性、地域、世代に関するものなど、１３０２で取得できる属性であれば、いかなるものを用いてもよい。また、発音辞書は２種類であったが、属性の種類に応じて、更に多くのものを持っていてもよい。また、単漢字辞書は１つであったが、発音辞書の種類に応じた複数の単漢字辞書を用いてもよい。また、１３０２では、属性が一意に決定されていたが、確率的に属性を決定し（例えば、人名は０．９、地名は０．１など）、これをスコアや重み付けとして考慮することによって１３０５で文字列の読みを同定することも可能である。 In the present embodiment, the pronunciation dictionary is related to a person name and a place name. However, the present invention is not limited to this. For example, attributes that can be acquired in 1302 such as names and surnames, men and women, regions, and generations. Any one can be used. Moreover, although there are two types of pronunciation dictionaries, more pronunciation dictionaries may be provided depending on the types of attributes. Further, although there is one single kanji dictionary, a plurality of single kanji dictionaries corresponding to the type of pronunciation dictionary may be used. In 1302, the attribute is uniquely determined. However, the attribute is determined probabilistically (for example, 0.9 for a person name, 0.1 for a place name, etc.), and considering this as a score or weight 1305 It is also possible to identify the reading of the string.

なお、本発明は、前述した実施例の各機能を実現するプログラムを、システムまたは装置に直接または遠隔から供給し、そのシステムまたは装置に含まれるコンピュータがその供給されたプログラムコードを読み出して実行することによっても達成される。 In the present invention, a program for realizing each function of the above-described embodiments is supplied directly or remotely to a system or apparatus, and a computer included in the system or apparatus reads and executes the supplied program code. Can also be achieved.

従って、本発明の機能・処理をコンピュータで実現するために、そのコンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、上記機能・処理を実現するためのコンピュータプログラム自体も本発明の一つである。 Accordingly, since the functions and processes of the present invention are implemented by a computer, the program code itself installed in the computer also implements the present invention. That is, the computer program itself for realizing the functions and processes is also one aspect of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。 In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.

プログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷなどがある。また、記録媒体としては、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）などもある。 Examples of the recording medium for supplying the program include a flexible disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, and CD-RW. Examples of the recording medium include a magnetic tape, a non-volatile memory card, a ROM, a DVD (DVD-ROM, DVD-R), and the like.

また、プログラムは、クライアントコンピュータのブラウザを用いてインターネットのホームページからダウンロードしてもよい。すなわち、ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードしてもよい。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードする形態も考えられる。つまり、本発明の機能・処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明の構成要件となる場合がある。 The program may be downloaded from a homepage on the Internet using a browser on a client computer. That is, the computer program itself of the present invention or a compressed file including an automatic installation function may be downloaded from a home page to a recording medium such as a hard disk. Further, it is also possible to divide the program code constituting the program of the present invention into a plurality of files and download each file from a different home page. That is, a WWW server that allows a plurality of users to download a program file for realizing the functions and processing of the present invention on a computer may be a constituent requirement of the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布してもよい。この場合、所定条件をクリアしたユーザにのみ、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報で暗号化されたプログラムを復号して実行し、プログラムをコンピュータにインストールしてもよい。 Further, the program of the present invention may be encrypted and stored in a storage medium such as a CD-ROM and distributed to users. In this case, only the user who cleared the predetermined condition is allowed to download the key information to be decrypted from the homepage via the Internet, decrypt the program encrypted with the key information, execute it, and install the program on the computer May be.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現されてもよい。なお、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行ってもよい。もちろん、この場合も、前述した実施形態の機能が実現され得る。 Further, the functions of the above-described embodiments may be realized by the computer executing the read program. Note that an OS or the like running on the computer may perform part or all of the actual processing based on the instructions of the program. Of course, also in this case, the functions of the above-described embodiments can be realized.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれてもよい。そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ってもよい。このようにして、前述した実施形態の機能が実現されることもある。 Furthermore, the program read from the recording medium may be written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer. Based on the instructions of the program, a CPU or the like provided in the function expansion board or function expansion unit may perform part or all of the actual processing. In this way, the functions of the above-described embodiments may be realized.

実施例に係る言語処理方法のモジュール構成を示したブロック図である。It is the block diagram which showed the module structure of the language processing method which concerns on an Example. 実施例に係る言語処理方法を搭載した言語処理装置のハードウェア構成を示したブロック図である。It is the block diagram which showed the hardware constitutions of the language processing apparatus carrying the language processing method which concerns on an Example. 従来の言語処理方法のモジュール構成を示したブロック図である。It is the block diagram which showed the module structure of the conventional language processing method. 実施例に係る言語処理方法のフローチャートである。It is a flowchart of the language processing method which concerns on an Example. 未登録語に対する読み付けを行う処理のフローチャートである。It is a flowchart of the process which reads with respect to an unregistered word. 単語辞書の例である。It is an example of a word dictionary. テキスト解析結果を単語ラティスによって表した例である。It is an example which expressed the text analysis result by the word lattice. 単漢字辞書の例である。It is an example of a single kanji dictionary. 発音辞書の例である。It is an example of a pronunciation dictionary. 未登録語の読み付け候補を（ａ）発音ラティス、および（ｂ）文字列読み候補として表した例である。It is the example which represented the reading candidate of the unregistered word as (a) pronunciation lattice and (b) character string reading candidate. スコア情報付きの単漢字辞書の例である。It is an example of the single kanji dictionary with score information. スコア情報付きの単漢字辞書から得られる文字列読み候補の例である。It is an example of a character string reading candidate obtained from a single kanji dictionary with score information. 実施例に係る文字列属性取得部を保持する言語処理方法のモジュール構成を示したブロック図である。It is the block diagram which showed the module structure of the language processing method holding the character string attribute acquisition part which concerns on an Example.

Claims

Detecting means for detecting a character string not registered in the word dictionary from a processing target including a plurality of character strings;
Obtaining means for obtaining candidates for reading each character in the character string detected by the detecting means using a single kanji dictionary;
Generating means for generating reading candidates for the entire character string detected by the detecting means from the reading candidates;
A language processing apparatus comprising: a selection unit that selects a reading of the character string from the character string reading candidates using a pronunciation dictionary.

The language processing apparatus according to claim 1, wherein the single kanji dictionary holds a reading score for each character.

The language processing apparatus according to claim 1, wherein the pronunciation dictionary holds a reading score for a character string.

The language processing apparatus according to claim 1, wherein the selection unit identifies a reading of the character string from the character string reading candidates and selects the identified reading.

The language processing apparatus according to claim 1, wherein the selection unit selects a character string reading using at least one of a reading score for each character and a reading score for the character string.

The language processing apparatus according to claim 5, further comprising a plurality of candidate output means for outputting the one having a high score as a plurality of candidates.

The language processing apparatus according to claim 1, wherein the reading of the character string selected by the selection unit is used for reading in text-to-speech synthesis or creation of a pronunciation dictionary or speech recognition grammar in speech recognition.

Detecting means for detecting a character string not registered in the word dictionary from a processing target including a plurality of character strings;
Obtaining means for obtaining attribute information indicating an attribute of the character string detected by the detecting means;
Generating means for generating candidates for reading each character in the character string detected by the detecting means, using at least one single kanji dictionary corresponding to the attribute information;
Character string reading candidate generation means for generating reading candidates for the entire character string from the reading candidates;
A language processing apparatus comprising: selection means for selecting reading of the character string from the character string reading candidates using at least two or more pronunciation dictionaries corresponding to the attribute information.

9. The language processing apparatus according to claim 8, wherein the attribute of the character string relates to at least one of a person name, place name, organization name, gender, generation, and region.

A detection step of detecting a character string not registered in the word dictionary from a processing target including a plurality of character strings;
An acquisition step of acquiring candidates for reading each character in the character string detected in the detection step using a single kanji dictionary;
Generating a candidate for reading for the entire character string detected in the detection step from the reading candidate;
A language processing method comprising: using a pronunciation dictionary to select a reading of the character string from the character string reading candidates.

The language processing method according to claim 10, wherein the single-kanji dictionary holds a reading score for each character.

The language processing method according to claim 10, wherein the pronunciation dictionary holds a reading score for a character string.

The language processing method according to claim 10, wherein the selecting step identifies a reading of the character string from the character string reading candidates and selects the identified reading.

13. The language processing method according to claim 10, wherein the selection step selects the reading of the character string using at least one of a reading score for each character and a reading score for the character string.

The language processing method according to claim 14, further comprising a multiple candidate output step of outputting the high score as a plurality of candidates.

The language processing method according to claim 10, wherein the reading of the character string selected in the selection step is used for reading in text-to-speech synthesis or creation of a pronunciation dictionary or speech recognition grammar in speech recognition.

A detection step of detecting a character string not registered in the word dictionary from a processing target including a plurality of character strings;
An acquisition step of acquiring attribute information indicating the attribute of the character string detected in the detection step;
A generation step of generating reading candidates for each character in the character string detected in the detection step using at least one single kanji dictionary corresponding to the attribute information;
A character string reading candidate generation step of generating reading candidates for the entire character string from the reading candidates;
A language processing method comprising: a selection step of selecting reading of the character string from the character string reading candidates using at least two or more pronunciation dictionaries corresponding to the attribute information.

18. The language processing method according to claim 17, wherein the attribute of the character string relates to at least one of a person name, place name, organization name, gender, generation, and region.

A control program for causing a computer to execute the language processing method according to claim 10.