JP2022141520A

JP2022141520A - Voice synthesis symbol editing device, method and program

Info

Publication number: JP2022141520A
Application number: JP2021041871A
Authority: JP
Inventors: 信行西澤; Nobuyuki Nishizawa
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2022-09-29

Abstract

To provide a voice synthesis symbol editing device that can be used by reducing impact of a TTS system configuration and modifications thereof.SOLUTION: There is provided a voice synthesis symbol editing device 10 that accepts input of a pair of text and a sequence of voice synthesis symbols corresponding to the text, and outputs the sequence of voice synthesis symbols, which includes: a correspondence relation estimation unit 11 that divides the input text into predetermined character units as well as the input voice synthesis symbols into predetermined voice synthesis symbol tokens, and gives a correspondence relation between each of the divided characters and each of the zero or more voice synthesis symbol tokens, based on a predetermined criteria; and an editing interface section 123 that displays each character and each of the zero or more phonetic synthesis symbol tokens given the correspondence relation, while updating the phonetic synthesis symbol tokens by accepting user editing operations.SELECTED DRAWING: Figure 1

Description

本発明は、テキストと音声合成記号列とを入力として、音声合成記号列に対するユーザ操作を可能とする音声合成記号編集装置、方法及びプログラムに関する。 TECHNICAL FIELD The present invention relates to a speech synthesis symbol editing apparatus, method, and program that allow a user to operate a speech synthesis symbol string by inputting a text and a speech synthesis symbol string.

音声合成技術とは音声を人工的に合成する手法である。代表的な利用方法として、テキスト音声変換（ｔｅｘｔ－ｔｏ－ｓｐｅｅｃｈ，以下ＴＴＳという）が挙げられるが、例えば日本語では、ＴＴＳの入力となるテキストは通常、漢字仮名交じり文であり、例えば文字と合成すべき音声の特徴とを直接マッピングすることはその関係性の構造が極めて複雑であることから困難である。そこで抽象化された中間表現を用い、テキストから中間表現、中間表現から音声の特徴、という２段階の変換を経て、音声の特徴の情報にあう音声波形を信号処理的に生成、あるいは事前準備した波形の蓄積から適切なものを選択することで、合成音声波形を得ることができる。 Speech synthesis technology is a technique for artificially synthesizing speech. A typical usage is text-to-speech (hereafter referred to as TTS). Direct mapping of speech features to be synthesized is difficult due to the extremely complex relationship structure. Therefore, using an abstracted intermediate representation, through a two-step conversion from the text to the intermediate representation and from the intermediate representation to the speech features, we generated or prepared speech waveforms that match the speech feature information using signal processing. A synthesized speech waveform can be obtained by selecting an appropriate one from the accumulation of waveforms.

中間表現は例えば音声合成のための指令記号列（以下音声合成記号列という）であり、日本語音声合成のための音声合成記号列の例として、非特許文献１（ＪＥＩＴＡＩＴ－４００６「日本語テキスト音声合成用記号」）が挙げられる。ＩＴ－４００６では「仮名レベルの表記」と「異音レベルの表記」の二種類の記法が規定されているが、いずれも、音素列（ただしここで「音素」の定義は、音声学的な定義ではなく、音声を構成する音の種類といった、より抽象化された定義とする。以下同様。）と韻律制御に関する記述の組み合わせで構成される。 The intermediate representation is, for example, a command symbol string for speech synthesis (hereinafter referred to as a speech synthesis symbol string). symbols for text-to-speech synthesis"). IT-4006 defines two types of notation, "kana level notation" and "allophone level notation". It is not a definition, but a more abstract definition such as the types of sounds that make up speech.The same applies below.) and a combination of descriptions related to prosody control.

一方、音声の物理的特徴とは、例えば音声の音響的特徴を表すパラメータの時系列データであり、具体的には、例えば１ｍｓ間隔のケプストラム係数や基本周波数の値として表される。 On the other hand, the physical features of speech are, for example, time-series data of parameters representing the acoustic features of speech, and are specifically expressed as, for example, cepstrum coefficients at 1 ms intervals and fundamental frequency values.

ところで、テキスト音声合成システムにおける読み誤りのほとんどは、テキスト（漢字仮名交じり文）から音声合成記号列への変換処理（以下テキスト解析という）で生じる。従って合成音声の作成者がテキストから自動変換された音声合成記号列をさらに修正することで、読み誤りのない合成音声を作成できる。 By the way, most reading errors in a text-to-speech synthesis system occur in conversion processing (hereinafter referred to as text analysis) from text (a sentence containing kanji and kana) to a speech synthesis symbol string. Therefore, by further correcting the synthesized speech symbol string automatically converted from the text by the creator of the synthesized speech, synthesized speech without reading errors can be created.

特開２０１３－１３４３９６号公報「合成音声修正装置，方法，及びプログラム」Japanese Unexamined Patent Publication No. 2013-134396 "Synthetic speech correction device, method, and program"

ＪＥＩＴＡ規格ＩＴ－４００６「日本語テキスト音声合成用記号」JEITA standard IT-4006 "Symbols for Japanese text-to-speech synthesis"

多くの音声合成記号列の定義は、音声学的な知見に基づき設計することが多い。例えば前掲のＪＥＩＴＡＩＴ－４００６のような比較的記述が容易な形式でも、その理解には音素の種類、東京方言に関するアクセント、イントネーションの記述の方法に関する事前知識が必要である。ＪＥＩＴＡＩＴ－４００６に規定される仮名レベルの表記では、音素の記号（規格上は「読み記号」として定義される）は片仮名（ただしヂやヅやヲはそれぞれジ、ズ、オと書く）であり理解が容易であるのに対し、アクセント核の記号「'」や、アクセント句境界、フレーズ境界、ポーズの記号（以下これらを韻律境界記号と呼ぶ）である「／」「｜」「＿」は数多くのユーザにとってその概念そのものから馴染みのないもので、特に韻律境界記号の適切な挿入にはある程度の知見・経験を必要とする。 Definitions of many speech synthesis symbol strings are often designed based on phonetic knowledge. For example, even in a format such as JEITA IT-4006 mentioned above, which is relatively easy to describe, prior knowledge of the types of phonemes, accents related to the Tokyo dialect, and methods of describing intonation is necessary to understand the format. In the kana-level notation specified in JEITA IT-4006, phoneme symbols (defined as "pronunciation symbols" in the standard) are written in katakana (however, di, z, and w are written as di, z, and o, respectively). ``/'' ``|'' is unfamiliar to many users from the very concept of prosody.

そこで、一般ユーザの利便性を高めるために、特許文献１に示すような音声合成システムのための編集装置が考案されている。ただし、ここに示された方法は、通常のＴＴＳシステムでも用いられる、形態素解析処理をベースとするテキスト解析処理の過程の一部に、ユーザによる修正処理を含む形と等価であり、音声合成記号列の編集システムはＴＴＳシステムのテキスト解析処理を包含している。 Therefore, in order to improve convenience for general users, an editing apparatus for a speech synthesis system as shown in Patent Document 1 has been devised. However, the method shown here is equivalent to the process of text analysis processing based on morphological analysis processing, which is also used in ordinary TTS systems, and includes correction processing by the user. The string editing system includes the text analysis process of the TTS system.

ところで、音声合成記号列に関する知識・経験のないユーザによる、音声合成のための音声合成記号の編集操作を考えた場合、システムはテキストの各文字、あるいは各形態素と音声合成記号を構成する記号要素（以下音声合成記号トークンと呼ぶ）の対応関係を提示することが望ましい。 By the way, if a user who has no knowledge or experience with speech synthesis symbol strings edits speech synthesis symbols for speech synthesis, the system will consider each character of the text, or each morpheme, and the symbolic elements that make up the speech synthesis symbol. (hereinafter referred to as speech synthesis symbol tokens).

また、知識・経験のないユーザは、音声合成記号列を読むこと自体が容易ではないため、合成記号列編集の必要性に関する有無の判断は、主に合成音声の聴取に基づき判断すると考えられる。この際、対象となるＴＴＳシステムによる読み方を編集システム上で正確に再現できると、編集装置を合成音声のチェック用のシステムとして使えて便利である。例えば、データ圧縮等のために、自動処理では読み誤る文に対してのみ音声合成記号列を追加情報として送る音声合成を使った情報配信システムにおいて、追加情報を送る必要性の有無も編集装置上で判断できる。このため編集システムは、対象とするＴＴＳシステムにおける自動処理結果も提示しかつその合成音声を再現できることが望ましい。 In addition, since it is not easy for users with no knowledge or experience to read synthesized speech symbol strings, it is considered that the decision as to whether or not synthetic symbol string editing is necessary is made mainly based on listening to synthesized speech. At this time, if the reading by the target TTS system can be accurately reproduced on the editing system, it is convenient to use the editing device as a system for checking synthesized speech. For example, in an information distribution system that uses speech synthesis to send additional information, such as a speech synthesis symbol string, only for sentences that are misread by automatic processing, for data compression, etc., whether or not there is a need to send additional information depends on the editing device. can be determined by Therefore, it is desirable that the editing system can also present the result of automatic processing in the target TTS system and reproduce the synthesized speech.

ところが、テキストから音声合成記号列への変換で、例えば深層学習を用いたｅｎｄ－ｔｏ－ｅｎｄの推定手法等、テキストを構成する文字と、出力の音声合成記号列を構成する各記号の対応関係が直接得られない変換処理を用いるＴＴＳシステムの構成があり得る。対象とするＴＴＳシステムがそのような変換手法を用いている場合、そのようなＴＴＳシステムを対象とする音声合成記号列編集装置では、従来、（１）システムにおいて、テキストの文字若しくは形態素と、対応する音声合成記号トークンとの関係性の提示を断念する、（２）両者間の対応関係が得られるような、ＴＴＳシステムで用いられているものとは別のテキスト解析処理を編集システムで用いる、のいずれか一方を選択しなければならないという問題があった。 However, in conversion from text to a speech synthesis symbol string, for example, an end-to-end estimation method using deep learning, etc., the correspondence relationship between the characters that make up the text and each symbol that makes up the output speech synthesis symbol string There may be configurations of TTS systems that use conversion processes in which is not directly available. When a target TTS system uses such a conversion technique, conventional speech synthesis symbol string editing devices for such a TTS system have conventionally (1) used text characters or morphemes in the system and corresponding (2) use a different text analysis process in the editing system than that used in the TTS system, such that the correspondence between the two can be obtained; There was a problem that one of them must be selected.

また、先述したようにテキスト解析の誤りは比較的多いことから、実際のＴＴＳシステムでは、テキスト解析処理の内部処理を頻繁に変更することは多い。編集作業者の利便性を考えると実際のＴＴＳシステムのテキスト解析結果を編集装置上でも常に正確に再現できることが望ましく、これを実現するためには、ＴＴＳシステムの改修の都度、編集装置の改修も必要になるといった問題があった。 In addition, since there are relatively many text analysis errors as described above, the actual TTS system frequently changes the internal processing of the text analysis process. Considering the convenience of editors, it is desirable that the text analysis results of the actual TTS system can always be accurately reproduced on the editing device. I had a problem with the need.

上記従来技術の課題に鑑み、本発明は、ＴＴＳシステムの構成やその改修による影響を低減して利用することのできる、音声合成記号編集装置、方法及びプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION In view of the problems of the prior art described above, it is an object of the present invention to provide a text-to-speech symbol editing apparatus, method, and program that can be used while reducing the effects of the configuration and modification of a TTS system.

上記目的を達成するため、本発明は、テキストと、当該テキストに対応する音声合成記号列の組を入力し、音声合成記号列を出力する音声合成記号編集装置であって、前記入力されたテキストを所定の文字単位へと、前記入力された音声合成記号を所定の音声合成記号トークンへとそれぞれ分割し、当該分割された各文字と０個以上の各音声合成記号トークンとの対応関係を所定の基準に基づき与える対応関係推定部と、前記対応関係が与えられた各文字及び０個以上の各音素合成記号トークンを、ユーザ編集操作を受け付けることによって音素合成記号トークンを更新しながら表示する編集インタフェース部と、を備えることを特徴とする。また、前記音声合成記号編集装置に対応する方法及びプログラムであることを特徴とする。 To achieve the above object, the present invention provides a text-to-speech symbol editing device for inputting a text and a set of text-to-speech synthesis symbol strings corresponding to the text, and for outputting a speech synthesis symbol string, comprising: is divided into predetermined character units, and the input speech synthesis symbol is divided into predetermined speech synthesis symbol tokens. and an editing unit that displays each character and zero or more phoneme synthesis symbol tokens to which the correspondence relationship is given while updating the phoneme synthesis symbol tokens by accepting user editing operations. and an interface unit. Further, the present invention is characterized by a method and a program corresponding to the speech synthesis symbol editing device.

本発明によれば、テキストにおける各文字と、０個以上の各音声合成記号トークンとの対応関係を与えたうえで、当該対応関係をユーザに対して表示してユーザの編集操作を受け付けることにより、音声合成記号編集装置をＴＴＳシステムの構成やその改修による影響を低減して利用することが可能となる。 According to the present invention, after giving the correspondence between each character in the text and each of zero or more speech synthesis symbol tokens, the correspondence is displayed to the user and the user's editing operation is accepted. , it becomes possible to use the speech synthesis symbol editing device while reducing the influence of the structure of the TTS system and its modification.

一実施形態に係る音声合成記号編集装置の機能ブロック図である。1 is a functional block diagram of a speech synthesis symbol editing device according to an embodiment; FIG. 一実施形態に係る合成音声編集装置の機能ブロック図である。1 is a functional block diagram of a synthesized speech editing device according to one embodiment; FIG. 対応関係推定部における処理の例を示す図である。It is a figure which shows the example of a process in a correspondence estimation part. 一般的なコンピュータにおけるハードウェア構成を示す図である。It is a figure which shows the hardware constitutions in a common computer.

図１は、一実施形態に係る音声合成記号編集装置の機能ブロック図である。音声合成記号編集装置10は、対応関係推定部11、編集結果表示部12、音声合成記号編集部13、音声合成記号修正部14、編集記録データベース15及び編集再現部16を備える。図２は、一実施形態に係る合成音声編集装置20の機能ブロック図であり、音声合成編集装置20は、テキスト解析部21、図１にその構成を示す音声合成記号編集装置10及び音声合成部22を備える。 FIG. 1 is a functional block diagram of a speech synthesis symbol editing device according to one embodiment. The speech synthesis symbol editing device 10 includes a correspondence estimation unit 11 , an editing result display unit 12 , a speech synthesis symbol editing unit 13 , a speech synthesis symbol correction unit 14 , an editing record database 15 and an editing reproduction unit 16 . FIG. 2 is a functional block diagram of a synthetic speech editing device 20 according to one embodiment. Equipped with 22.

音声合成記号編集装置10は、図２に示すような、任意のＴＴＳシステムとして構成されるテキスト解析部21において日本語テキスト（漢字仮名交じり文）を解析して得たその音声合成記号列（元の日本語テキストと共にユーザが参照可能なように出力される）をユーザが修正のため編集する操作を受け付けるインタフェースを提供するものである。ユーザ操作により編集された音声合成記号列はさらに音声合成部22へと出力され、合成音声として再生されることでユーザが編集結果の音声合成記号列による合成音声を聞いて確認することも可能である。ユーザは、この合成音声を聞いて確認する作業を行いながら逐次的に音声合成記号編集装置10で音声合成記号列の編集を行うようにしてもよい。（一方、熟練ユーザであれば音声合成記号列のみから合成音声の様子を把握することで、音声合成部22による合成音声を聞いて確認する作業を省略することもありうる。） As shown in FIG. 2, the speech synthesis symbol editing device 10 converts the speech synthesis symbol string (original is output so that the user can refer to it together with the Japanese text of the text), and provides an interface that accepts an operation for editing by the user for correction. The speech synthesis symbol string edited by the user's operation is further output to the speech synthesis unit 22 and reproduced as synthesized speech, so that the user can listen to and confirm the synthesized speech based on the edited speech synthesis symbol string. be. The user may sequentially edit the speech synthesis symbol string with the speech synthesis symbol editing device 10 while listening to and confirming the synthesized speech. (On the other hand, a skilled user may be able to omit the work of listening to and confirming the synthesized speech by the speech synthesizing unit 22 by grasping the state of the synthesized speech only from the speech synthesis symbol string.)

以下ではまず、音声合成記号編集装置10の各部の処理の概要について説明する。 First, an outline of processing of each unit of the speech synthesis symbol editing device 10 will be described below.

音声合成記号編集装置10では以下の流れ（１）～（４）でユーザによる編集作業を受け付け、ユーザ編集後の音声合成記号列を出力することができる。 The speech synthesis symbol editing apparatus 10 can accept editing work by the user in the following flow (1) to (4) and output a speech synthesis symbol string edited by the user.

（１）音声合成記号編集装置10はテキストである漢字仮名交じり文と、そのテキスト解析結果（当該テキストを図２のテキスト解析部21で解析して得た結果）である音声合成記号列とを入力として、対応関係推定部11において受け取る。 (1) The speech synthesis symbol editing device 10 converts a sentence mixed with kanji and kana, which is a text, and a speech synthesis symbol string, which is the text analysis result (result obtained by analyzing the text in the text analysis unit 21 in FIG. 2). It is received by the correspondence estimation unit 11 as an input.

（２）対応関係推定部11では、音声合成記号列を音声合成記号トークンの列に分割し、テキストの各文字に対し、０個以上の音声合成記号トークンを結びつけ、それを対応関係の情報として線L3で示すように編集結果表示部12へと出力する。（あるいは、後述する代替例として、線L4で示すように、対応関係推定部11の出力を編集再現部16を介して編集結果表示部12へと出力するようにしてもよい。） (2) The correspondence estimation unit 11 divides the speech synthesis symbol string into strings of speech synthesis symbol tokens, associates zero or more speech synthesis symbol tokens with each character of the text, and uses them as correspondence information. Output to the editing result display unit 12 as indicated by line L3. (Alternatively, as an alternative example to be described later, the output of the correspondence estimation unit 11 may be output to the editing result display unit 12 via the editing reproduction unit 16, as indicated by line L4.)

（３）編集結果表示部12は、対応関係推定部11の推定結果である対応関係をユーザに提示する。編集結果表示部12はハードウェアとしてはディスプレイで実現され、当該対応関係における各音声合成記号トークンや各テキストを対応関係とともに表示することでユーザに提示することができる。編集結果表示部12はまた、当該表示された推定結果を見て確認したユーザの指示入力（音声合成記号トークンのうち修正が必要な箇所を修正する指示入力であって次の音声合成記号編集部13で受け付けたもの）に基づき、音声合成記号トークンを連結して音声合成記号列（ユーザ編集により修正されたもの）を生成する。 (3) The editing result display unit 12 presents the correspondence, which is the estimation result of the correspondence estimation unit 11, to the user. The editing result display unit 12 is implemented by a display as hardware, and can present to the user by displaying each speech synthesis symbol token and each text in the corresponding relationship together with the corresponding relationship. The editing result display unit 12 also displays an instruction input from the user who has confirmed the displayed estimation result (an instruction input for correcting a part of the speech synthesis symbol token that needs to be corrected and is the next speech synthesis symbol editing unit 13), the speech synthesis symbol tokens are concatenated to generate a speech synthesis symbol string (corrected by user editing).

編集結果表示部12で表示され生成（保持）された音声合成記号列は、双方向矢印L0で示されるように、ユーザ操作による編集の途中においては音声合成記号編集部13と共有する形で保持され、ユーザが編集を終えたと判断して（あるいは、中間的な確認結果を得たいと判断して）編集を確定させる入力を行った場合には、線L1に示すように編集確定した音声合成記号列として音声合成記号編集装置10から出力される。（あるいは、線L1のように直接そのまま出力するのではなく線L2に示すように、後述する代替例として、音声合成記号修正部14による自動修正処理をさらに適用したうえで出力するようにしてもよい。） The text-to-speech symbol string displayed and generated (held) by the editing result display unit 12 is shared with the text-to-speech symbol editing unit 13 during editing by the user operation, as indicated by the double-headed arrow L0. When the user determines that the editing has been completed (or determines that the user wants to obtain an intermediate confirmation result) and performs an input to confirm the editing, as shown by the line L1, the edited speech synthesis is confirmed. It is output from the speech synthesis symbol editing device 10 as a symbol string. (Alternatively, instead of directly outputting as shown by line L1, as shown by line L2, as an alternative example described later, automatic correction processing by the speech synthesis symbol correction unit 14 may be further applied and then output. good.)

編集結果表示部12はすなわち、表示インタフェースとして、編集途中のテキストの各文字と０個以上の音声合成記号トークンとの対応関係をユーザに対して表示する役割を有するものである。次の音声合成記号部編集部13でユーザ操作により音声合成記号トークンを修正した場合、当該修正した結果が逐次的に反映されて編集結果表示部13を介してユーザに表示される。このように、編集結果表示部12及び音声合成記号編集部13の両者は編集インタフェース部123を形成し、逐次的な編集結果を反映した表示機能を実現するための、音声合成記号を編集するための編集バッファをユーザに提供するものとなり、文単位、あるいはユーザが編集作業を望む任意の単位で、このような編集バッファを提供することができる。 That is, the editing result display unit 12 serves as a display interface to display to the user the correspondence between each character of the text being edited and zero or more speech synthesis symbol tokens. Next, when the speech synthesis symbol token is corrected by the user's operation in the speech synthesis symbol editing unit 13, the result of the correction is successively reflected and displayed to the user via the editing result display unit 13. FIG. In this way, both the editing result display unit 12 and the speech synthesis symbol editing unit 13 form an editing interface unit 123 for editing the speech synthesis symbols for realizing a display function reflecting the sequential editing results. of editing buffers to the user, and such editing buffers can be provided on a sentence-by-sentence basis, or on any basis the user wishes to edit.

（４）音声合成記号編集部13は、前記の対応関係を目視確認したユーザの判断により、所定の文字に対して結びつけられた音声合成記号トークンを、当該判断したユーザ操作に従って置換または削除し、あるいは、その文字に結びつける音声合成記号トークンを追加し、結果を編集結果表示部12に送り返す（双方向矢印L0で示されるように逐次的な編集の反映結果を編集結果表示部12及び音声合成記号編集部13で共有する）ことで、ユーザ操作による対応関係の修正を逐次的に受け付け、ユーザに対して表示させるようにする。 (4) The text-to-speech symbol editing unit 13 replaces or deletes the text-to-speech symbol token associated with a predetermined character based on the judgment of the user who visually confirms the correspondence relationship, according to the determined user operation, Alternatively, a speech synthesis symbol token linked to the character is added, and the result is sent back to the editing result display unit 12 (as indicated by the two-way arrow L0, the editing result display unit 12 and the speech synthesis symbol (shared by the editing unit 13), the correction of the correspondence relationship by the user's operation is sequentially accepted and displayed to the user.

（５）音声合成記号修正部14は、線L1のように音声合成記号列を直接出力することに代えて線L2のように出力する際に利用され、編集結果表示部12における音声合成記号トークンを連結して作成した音声合成記号列を、音声合成記号列の要求仕様を満たすように所定の規則に基づき書き換えて出力する。 (5) The speech synthesis symbol correction section 14 is used when outputting the speech synthesis symbol string as indicated by the line L2 instead of directly outputting the speech synthesis symbol string as indicated by the line L1. is rewritten based on a predetermined rule so as to meet the required specifications of the speech synthesis symbol string and output.

以上の（１）～（５）の流れは、ユーザがその都度、全てを編集することを前提としたものであったが、ユーザによる過去の編集結果を蓄積しておくことで事前に自動的に適用し、ユーザの編集作業の負荷を低減するための構成としてさらに、以下の機構による（６）～（７）の流れの処理を行うようにしてもよい。このための追加的な構成として、音声合成記号編集装置10は編集記録データベース15及び編集再現部16を備えていてもよい。（なお、この追加的な構成が適用される場合、対応関係部11からの出力は線L4で示す流れで処理され、適用されない場合は、当該出力は線L3で示す流れで処理される。） The flow of (1) to (5) above was based on the assumption that the user would edit everything each time. , and as a configuration for reducing the user's editing work load, the processing of the flow of (6) to (7) may be performed by the following mechanism. As an additional configuration for this purpose, the text-to-speech symbol editing apparatus 10 may include an edit record database 15 and an edit reproduction unit 16. FIG. (If this additional configuration is applied, the output from the correspondence unit 11 is processed according to the flow indicated by line L4, and if not applied, the output is processed according to the flow indicated by line L3.)

（６）編集記録データベース15は、音声記号編集部13におけるユーザ操作により、テキストの文字に結びつけれた音声合成記号トークンに対する編集（置換・削除・追加）操作を、文字、編集後の音声合成記号トークンの組の形で編集記録データベース15に保存する。この際、テキストにおける前後１文字以上の文字、編集前のトークン音声合成記号も併せて保存してもよい。 (6) The editing record database 15 performs editing (replacement, deletion, addition) operations on speech synthesis symbol tokens linked to characters of the text by user operations in the phonetic symbol editing unit 13. Store in the edit record database 15 in the form of a set of tokens. At this time, one or more characters before and after the text and the token speech synthesis symbol before editing may be stored together.

（７）編集再現部16は、対応関係推定部11の出力である、テキストを構成する文字の列と、各文字に結びつけられた音声合成記号トークンに対して、編集記録データベース15を参照して、所定の基準により、各文字に結びつけられた音声合成記号トークンに対する編集操作を行う。この編集反映部16を機能させるかどうかはユーザに選択させてもよい。 (7) The edit reproduction unit 16 refers to the edit record database 15 for the string of characters that make up the text and the speech synthesis symbol tokens associated with each character, which are output from the correspondence estimation unit 11. , performs an editing operation on the speech synthesis symbol token associated with each character according to a predetermined standard. The user may be allowed to select whether or not to allow the edit reflecting section 16 to function.

以下、音声合成記号編集装置10の各部の詳細についてさらに説明する。 Details of each unit of the speech synthesis symbol editing device 10 will be further described below.

＜対応関係推定部11＞
対応関係推定部11における処理は、例えば次のようになる。図３に例示するように、「今日は良い天気ですね。」というテキストに対し、ＪＥＩＴＡＩＴ－４００６仮名レベル表記による音声合成記号は例えば「キョ'ーワ｜イ'ー／テ'ンキデスネ．」のようになる。音声合成記号編集装置10の入力は両者の組であり、それがそのまま対応関係推定部11の入力になる。 <Correspondence estimation unit 11>
The processing in the correspondence estimation unit 11 is, for example, as follows. As exemplified in FIG. 3, for the text "It's nice weather today." become that way. The input of the speech synthesis symbol editing device 10 is a pair of both, which becomes the input of the correspondence estimation unit 11 as it is.

対応関係推定部ではテキストを文字単位に、音声合成記号をトークン単位にそれぞれ分割する。ただし、平仮名および片仮名はモーラ単位で分割し、１モーラ文の文字列を１文字と見なす。例えば「きゃ」で１文字とみなす。 The correspondence estimator divides the text into characters and the speech synthesis symbols into tokens. However, hiragana and katakana are divided into mora units, and a character string of one mora sentence is regarded as one character. For example, "kya" is regarded as one character.

図３の例の場合、図中にも四角（□）で囲んで分割された単位を示すように、それぞれ
「今」「日」「は」「良」「い」「天」「気」「で」「す」「ね」「。」
「キョ'」「ー」「ワ」「｜」「イ'」「ー」「／」「テ'」「ン」「キ」「デ」「ス」「ネ」「．」
と分割される。（なお、「イ'」「ー」については後述の処理がさらに適用された結果が示されている。） In the case of the example in Figure 3, as shown in the figure as well, the units enclosed by squares (□) are divided into ``now'', ``day'', ``ha'', ``good'', ``i'', ``heaven'', ``ki'', and `` is not it""."
``kyo'''``ー''``wa''``|''``i'''``ー''``/''``te'''``n''``ki''``de''``su''``ne''``.''
is divided into (Note that "i'" and "-" are the result of further application of the processing described later.)

なお、ＩＴ－４００６におけるアクセント核記号「'」は、ここでは独立したトークンとはみなさず、先行する読み記号と合わせて１つのトークンとして考える。 Note that the accent core symbol "'" in IT-4006 is not regarded as an independent token here, but is considered as one token together with the preceding reading symbol.

そしてテキストの各文字に対して、０個以上の音声合成記号トークンを結びつける。テキスト中の仮名文字と音声合成記号トークンのうちの読み記号との対応関係は明らかであり、また、漢字には通常何らかの読みがある、また韻律境界記号は句読点や括弧には結びつけても良い、という経験的な規則（ルールベース）を適用することで、図３中にも示すように、
「今」＝「キョ'」
「日」＝「ー」
「は」＝「ワ」
「」＝「｜」
「良」＝「ヨ'」
「い」＝「イ」
「」＝「／」
「天」＝「テ'」「ン」
「気」＝「キ」
「で」＝「デ」
「す」＝「ス」
「ね」＝「ネ」
「。」＝「．」
という対応関係が得られる。（なお、図３ではテキストの文字単位に対応する音声合成記号列の単位を四角（□）で囲んで上下に並べて示しているが、編集結果表示部12は例えばこのような形で対応関係をグラフィカルに表示してユーザに示すことができる。） Then, for each character of the text, zero or more speech synthesis symbol tokens are associated. The correspondence between the kana characters in the text and the reading symbols in the text-to-speech symbol tokens is clear, and kanji characters usually have some reading, and prosody boundary marks may be associated with punctuation marks and parentheses. By applying the empirical rule (rule base) of
``Now'' = ``Kyo'''
"day" = "-"
"Ha" = "Wa"
"" = "|"
``Good'' = ``Yo'''
"I" = "I"
"" = "/"
"Heaven" = "Te'""N"
"ki" = "ki"
``de'' = ``de''
"su" = "su"
"ne" = "ne"
"." = "."
A corresponding relationship is obtained. (In FIG. 3, the unit of the speech synthesis symbol string corresponding to the character unit of the text is surrounded by squares (□) and arranged vertically. It can be displayed graphically and shown to the user.)

この例では音声合成記号列途中の韻律境界記号「｜」および「／」に対応するテキストの文字が無いため、テキストの文字に空文字「」を挿入して対応関係を記述する。 In this example, since there is no text character corresponding to the prosodic boundary symbols "|" and "/" in the speech synthesis symbol string, the correspondence is described by inserting an empty character "" into the text characters.

この対応関係を自動的に求める処理は、例えば以下に示すようなコストを定義し、最小コスト法に基づく方法により行えばよい。そのような処理は、動的計画法等の既存の最適化の手法を用いて容易に実現できる。 The process of automatically obtaining this correspondence relationship may be performed by, for example, defining costs as shown below and using a method based on the minimum cost method. Such processing can be easily realized using existing optimization methods such as dynamic programming.

平仮名・片仮名の文字と、読み記号の片仮名が一致：コスト０
平仮名「は」に対して読みが「ワ」：コスト１
平仮名「へ」に対して読みが「エ」：コスト1
平仮名「あ」～「お」に対して読みが「ー」（長音記号）：コスト１
テキストの句点「。」に対して終端記号「．」が対応：コスト１
平仮名・片仮名の文字が上記４規則を満たさない：コスト１００
漢字1文字に対して１モーラの「ン」を除く任意の読み記号が対応：コスト２０
漢字1文字に対して読み記号「ン」が対応：コスト７０
漢字1文字に対してＮモーラ（Ｎ≧２）の「ン」で始まらない任意の読み記号が対応：コスト１０×（Ｎ－１）
漢字1文字に対してＮモーラ（Ｎ≧２）の「ン」で始まる読み記号が対応：コスト５０＋１０×（Ｎ－１）
漢字1文字に対して対応する読み記号が存在しない：コスト１００（すなわち、漢字に関しては可能な読みが存在しない場合も許容する。） The hiragana/katakana characters match the reading symbol katakana: cost 0
Hiragana "wa" is read as "wa": cost 1
Hiragana ``he'' is read as ``e'': cost 1
Hiragana "A" ~ "O" with reading "-" (long vowel symbol): Cost 1
A terminal symbol "." corresponds to the period "." of the text: cost 1
Hiragana/katakana characters do not satisfy the above four rules: Cost 100
Any reading symbol except ``n'' of 1 mora corresponds to 1 kanji character: cost 20
The reading symbol "n" corresponds to one kanji character: cost 70
Any reading symbol that does not start with "N" of N mora (N≧2) corresponds to one kanji character: Cost 10 × (N-1)
Corresponds to reading symbols starting with "N" of N mora (N≧2) for one kanji character: Cost 50 + 10 × (N - 1)
There is no reading symbol corresponding to one kanji character: cost 100 (that is, even if there is no possible reading for kanji, it is allowed.)

また、漢字の読みに対しては、予め単漢字辞書（１文字の漢字のそれぞれに対してその読みを振った辞書）を用意し、その読みと一致する場合に小さいコストとなるようなコスト定義とすることで、文字と音声合成記号トークンとの対応関係をより高精度に求めることができる。例えば上記の例では、漢字が並んでいる箇所に対し読み記号のトークンを均等に割り当てる形となるため、長い漢字文字列で構成される箇所で誤りが生じやすくなるが、単漢字辞書を用いて漢字と読み記号の対応関係をより正確に扱うことで、このような誤りを減らすことができる。 In addition, for readings of kanji, a single kanji dictionary (dictionary in which the reading is assigned to each kanji character) is prepared in advance, and the cost is defined so that the cost is small when the reading matches the reading. By doing so, the correspondence between characters and speech synthesis symbol tokens can be obtained with higher accuracy. For example, in the above example, since the pronunciation symbol tokens are evenly assigned to the locations where the kanji are lined up, errors tend to occur in locations consisting of long kanji character strings, but using the single kanji dictionary Such errors can be reduced by more accurately handling the correspondence between kanji and reading symbols.

＜音声合成記号編集部13＞
音声合成記号編集部13では文字に対応するトークンの修正、削除、追加を、ユーザ入力に従って行う。この時、韻律記号の挿入操作が行われた場合、それに空文字「」を含め対応する文字が存在しない場合は、対応する箇所に空文字「」を挿入する処理を、ユーザ操作を受けることなくルールベースにより自動で行うようにしてよい。また、音声合成記号トークンに削除により空文字「」に対応する音声合成記号トークンが存在しなくなった場合は、その空文字「」を文字列から削除する処理を、ユーザ操作を受けることなくルールベースにより自動で行うようにしてよい。 <Speech Synthesis Symbol Editor 13>
The text-to-speech symbol editing unit 13 corrects, deletes, and adds tokens corresponding to characters in accordance with user input. At this time, if an operation to insert a prosody mark is performed, and if there is no corresponding character, including an empty character "", the rule-based may be automatically performed by In addition, if there is no speech synthesis symbol token corresponding to the empty character "" in the speech synthesis symbol token due to deletion, the process of deleting the empty character "" from the character string is automatically performed by the rule base without user operation. You can do it with

＜音声合成記号修正部14＞
音声合成記号修正部14は音声合成記号編集装置10が形式的に不正な音声合成記号列を出力しないようにするための機構である。（形式不正な音声合成記号列は、図２の音声合成部22において音声合成することができない。） <Speech synthesis symbol correction section 14>
The speech synthesis symbol correction unit 14 is a mechanism for preventing the speech synthesis symbol editing device 10 from outputting a formally incorrect speech synthesis symbol string. (An invalid speech synthesis symbol string cannot be speech-synthesized by the speech synthesis unit 22 in FIG. 2.)

なお、編集インタフェース部123においてユーザ編集操作を逐次的に反映されながら各文字との対応関係を紐づけて（編集バッファとして）保持されている音声合成記号トークン列のうち、ユーザが確定した旨を編集結果表示部12に対して入力した部分列（編集バッファの全体でもよい）を音声合成記号修正部14に出力させるようにすればよい。ユーザは、この部分列の箇所のみを、音声合成記号修正部14を経たうえで例えば図２の音声合成部22において合成音声として出力させることで、部分列に対応する合成音声を聞いて確認するといった作業が可能となる。なお、この確定した旨の入力は、ユーザの編集操作が完全に終了したとユーザが判断した場合の他にも、中間的な確定により、合成音声を聞いて確認することをユーザが望む場合（合成音声が不自然であった場合、さらに編集を継続するため）にも可能である。 In the edit interface unit 123, while the user edit operation is sequentially reflected, among the speech synthesis symbol token strings held (as an edit buffer) in association with the corresponding relationship with each character, the fact that the user has confirmed is displayed. The partial string (or the entire editing buffer) input to the edit result display unit 12 may be output to the speech synthesis symbol correction unit 14 . The user listens to and confirms the synthesized speech corresponding to the subsequence by outputting only the portion of this subsequence as synthesized speech in, for example, the speech synthesizing unit 22 in FIG. Such work is possible. In addition to the case where the user has determined that the user's editing operation has been completely completed, this confirmation input is made when the user wishes to confirm by listening to the synthesized speech through an intermediate confirmation ( (to continue editing if the synthesized speech is unnatural).

例えば「キョ'ーワ｜ヨ'イ／テ'ンキデスネ．」の「／」を音声合成記号編集部13にてユーザが削除した場合、音声合成記号トークンを単純に連結すると「キョ'ーワ｜ヨ'イテ'ンキデスネ．」となるが、これはアクセント句（アクセント句境界「／」、フレーズ境界「｜」、ポーズ「＿」のいずれかで区切られる区間をアクセント句という）内にはアクセント核「'」は高々１個しか存在してはならないという、ＪＥＩＴＡＩＴ－４００６の規則を満たさない形式である。 For example, if the user deletes the "/" in "Kyo'wa | Yo'i/Te'nkidesune." However, within an accent phrase (a section delimited by either an accent phrase boundary ``/'', a phrase boundary ``|'', or a pause ``_''), there is an accent core It is a form that does not satisfy the JEITA IT-4006 rule that there must be at most one "'".

これに対し、音声合成記号修正部14は所定の規則に基づき、フォーマット的に適正な形式になるように音声合成記号列を書き換える。例えばアクセント句内の最初のアクセント核を残し、それ以外のアクセント核記号は削除する、という規則を設定した場合、音声合成記号修正部14は適正な形式である、「キョ'ーワ｜ヨ'イテンキデスネ．」という音声合成記号列を出力する。（すなわち、「テ'」のアクセント核を削除してアクセント句内のアクセント核「'」を高々１個に書き換えて出力する。）これにより、音声合成記号編集装置10の出力結果を、音声合成記号列を入力とする音声合成装置（図２の音声合成部22）に直接入力することができ、例えば、編集結果表示部12の表示内容を合成音声によりユーザは確認できる。 On the other hand, the speech synthesis symbol correction unit 14 rewrites the speech synthesis symbol string so as to have an appropriate form in terms of format based on a predetermined rule. For example, if a rule is set such that the first accent kernel within an accent phrase is left and the other accent kernel symbols are deleted, the speech synthesis symbol modification unit 14 has the proper format, ``Kyo'wa|Yo''. Itenkidesune." is output as a speech synthesis symbol string. (That is, the accent core of "te'" is deleted and the accent core "'" in the accent phrase is rewritten to at most one and output.) Thus, the output result of the speech synthesis symbol editing device 10 is converted into speech synthesis. The symbol string can be directly input to a speech synthesizer (speech synthesizer 22 in FIG. 2). For example, the user can confirm the display contents of the editing result display 12 by synthesized speech.

ここで、この音声合成記号修正部14の処理が（線L1の代替例として）線L2で示すように出力直前の処理となっていることで、編集箇所以外の元の情報はシステム上に（線L0で示す編集結果表示部12及び音声合成記号編集部13での共有データとして）保存されており、例えば、その後に、音声合成記号編集部13で元の位置に「／」を再度挿入した場合に、「天」の読み記号が「テ'」「ン」（テにアクセント核がある）となっている、元の音声合成記号が再現される。 Here, the processing of this speech synthesis symbol correction unit 14 (as an alternative example of line L1) is the processing immediately before output as indicated by line L2, so that the original information other than the edited part is transferred to the system ( (as shared data in the editing result display unit 12 and the speech synthesis symbol editing unit 13 indicated by the line L0), for example, after that, the speech synthesis symbol editing unit 13 reinserts "/" at the original position In this case, the original speech synthesis symbols are reproduced, in which the reading symbols for ``天'' are ``te''' and ``n'' (te has an accent core).

＜音声合成記号編集部13、編集記録データベース15、編集再現部16＞
音声合成記号編集部13では、編集操作を所定の様式で編集記録データベース15に記録しても良い。例えば「良」の読み記号を「ヨ'」から「イ'」に書き換えた場合、所定の様式として例えば、「文字「良」の読み記号「ヨ'」を「イ'」に書き換える」という編集操作を編集記録データベースに記録する。ここで所定の様式に関する別の例として、「その前の文字が「」、後ろの文字が「い」の文字「良」の読み記号「ヨ'」を「イ'」に書き換える」といったような、文字に関してより細かい条件を記載する様式や、「文字「良」の読み記号を「ヨ'」に書き換える」といったより簡略化された様式も用いてよい。 <Speech synthesis symbol editing unit 13, editing record database 15, editing reproduction unit 16>
The speech synthesis symbol editing unit 13 may record the editing operation in the editing record database 15 in a predetermined format. For example, when the reading symbol for ``good'' is rewritten from ``yo''' to ``i'', the prescribed format is, for example, ``rewrite the reading symbol ``yo''' of the character ``ryo'' to ``i''''. Record the action in the editorial database. Here, as another example of a predetermined format, ``Rewrite the reading symbol ``Yo'' of the character ``Yo'' with the preceding character '' and the following character ``I'' as ``I''''. , a form that describes more detailed conditions regarding characters, or a more simplified form such as ``Rewrite the reading symbol of the character ``good'' to ``Yo'''' may also be used.

また、編集記録データベース15がある場合、編集再現部16を設けることができる。編集再現部16では、音声合成記号編集装置10に対する新たな入力に対する対応関係推定部11の出力に対して、文字および音声合成記号トークンを参照して編集記録データベースに対するクエリを行い、クエリに対応する編集情報が編集記録データベース15にあった場合、この編集操作を編集再現部16の入力に対して行い、編集結果表示部12へと出力する。 Also, if there is an edit record database 15, an edit reproduction unit 16 can be provided. The editing reproduction unit 16 refers to the character and the speech synthesis symbol token for the output of the correspondence estimation unit 11 for the new input to the speech synthesis symbol editing device 10, queries the edit record database, and responds to the query. If the edit information is in the edit record database 15 , this edit operation is performed on the input of the edit reproduction unit 16 and output to the edit result display unit 12 .

すなわち、対応関係推定部11で得た、対応関係が与えられている文字及び音声合成記号トークンの列に対して、編集記録データベース15に記録されている編集情報のそれぞれの書き換え規則が適用可能かを照合し、適用可能なものがあれば適用したうえで、編集結果表示部12へと出力する。 That is, whether the rewrite rules for each of the editing information recorded in the editing record database 15 are applicable to the string of characters and speech synthesis symbol tokens to which the correspondence is given, which is obtained by the correspondence estimation unit 11. are collated, and if there is an applicable one, it is applied and then output to the editing result display unit 12.

先の例が編集記録データベースにあった場合、入力「今日は良い日和ですね。」「キョ'ーワ｜ヨ'イ／ヒヨリデ'スネ．」に対し、「良」の読みを「イ'」に書き換えたものを音声合成記号編集部13に出力することができ、ユーザによる同様の修正操作の手間を減らすことができる。 If the previous example were in the edit record database, the input "Today is a good day." ” can be output to the text-to-speech symbol editing unit 13, thereby reducing the user's time and effort for similar correction operations.

なお、この処理を適用するかどうかをユーザが選択できるようにしてもよい。すなわち、編集情報の書き換え規則が適用可能な箇所があった場合、ただちに適用して編集結果表示部12へと出力するのではなく、それぞれの書き換え規則が該当箇所において適用可能である旨の情報として編集結果表示部12においてユーザに対して表示し、ユーザは実際に適用するか否かの指示を音声合成記号編集部13に対して入力し、適用する指示があった箇所についてのみ、実際に適用した結果を編集結果表示部12で更新して表示させるようにしてもよい。 Note that the user may be allowed to select whether or not to apply this process. That is, when there is a place to which the rewrite rule of the editing information is applicable, it is not immediately applied and output to the edit result display unit 12, but as information to the effect that each rewrite rule is applicable to the relevant place. The editing results are displayed to the user in the editing result display unit 12, and the user inputs an instruction as to whether or not to actually apply to the speech synthesis symbol editing unit 13. The edited result may be updated and displayed by the editing result display unit 12. FIG.

以上、各実施形態の音声合成記号編集装置10によれば、対象となるＴＴＳシステム（図２のテキスト解析部21）のテキスト解析手法によらず、テキストの各文字と音声合成記号トークンとの対応関係をユーザに提示できる。これによりユーザはテキストを参照して音声合成記号列中の修正箇所を探し出すことができ、合成音声の読み誤りを容易に修正できる。また、ＴＴＳシステムと音声合成記号編集装置10がより独立したものとなり、ＴＴＳシステムの改修の都度、音声合成記号編集装置10の改修を行わなくても、ユーザの利便性を保つことができる。 As described above, according to the speech synthesis symbol editing device 10 of each embodiment, the correspondence between each character of the text and the speech synthesis symbol token is obtained regardless of the text analysis method of the target TTS system (the text analysis unit 21 in FIG. 2). Relationships can be presented to the user. As a result, the user can refer to the text to search for corrections in the synthesized speech symbol string, and can easily correct misreading of the synthesized speech. In addition, the TTS system and the text-to-speech symbol editing device 10 become more independent, and user convenience can be maintained without modifying the text-to-speech symbol editing device 10 each time the TTS system is modified.

すなわち、テキストを構成する文字と音声合成記号列の読み記号との対応関係を音声合成記号編集装置10内の対応関係推定部11で独自に生成することで、ユーザの利便性を保ちつつ、ＴＴＳシステム内のテキスト解析部21の改修が音声合成記号編集システムに影響しないようにすることができる。 That is, by uniquely generating the correspondence relationship between the characters constituting the text and the reading symbol of the speech synthesis symbol string by the correspondence estimation unit 11 in the speech synthesis symbol editing device 10, the user's convenience can be maintained and the TTS Modifications to the text analysis unit 21 within the system can be prevented from affecting the text-to-speech symbol editing system.

以下、種々の補足例、追加例、代替例などに関して説明する Various supplementary examples, additional examples, alternative examples, etc. will be described below.

＜１＞対応関係推定部11での例外処理について
対応関係推定部11において、前述のテキストの各文字は、複数文字の連続を例外的に１文字と見なして処理するようにしても良い。例えば「今日」（キョ'ー）のような熟字訓に対して、上記方法でも「今」＝「キョ'」、日＝「ー」のように便宜的に対応付けられることで破綻なく処理されるが、このような対応関係の表示はユーザに違和感を与える。これに対し、例外的に「今日」を一文字と見なすことにより、この問題は回避できる。 <1> Exceptional Processing in Correspondence Estimating Unit 11 In the correspondence estimating unit 11, each character of the above-described text may exceptionally be treated as one character when a plurality of characters are consecutive. For example, for ``kyou''(kyo'-), the above method can be processed without failure by conveniently associating ``now'' = ``kyo''' and day = ``-''. However, the display of such a correspondence gives the user a sense of discomfort. On the other hand, this problem can be avoided by exceptionally regarding "today" as one character.

また、対応関係推定部11で、前述のようにテキスト内のある１文字に対して読み記号を１つも割り当てないことに対して大きなペナルティを設定した場合に、「百舌鳥」（モ'ズ）のようなケースでは対応関係の推定では破綻が生じ、テキストの少なくともその周辺の文字も影響を受けるが、「百舌鳥」を１文字扱いとすればこの問題を回避できる。ただし、このような処理が必要な単語は限られていることから、どのような文字列を１文字扱いにするかについては、経験に基づき事前に規則を決めておく方法でよい。 Also, when the correspondence estimation unit 11 sets a large penalty for assigning no reading symbols to a certain character in the text as described above, In such a case, the estimation of the correspondence relationship fails, and at least the surrounding characters of the text are also affected. However, since the number of words that require such processing is limited, it is possible to determine in advance a rule based on experience regarding what kind of character string is to be treated as a single character.

なお、音声合成記号編集部13では、韻律境界記号の挿入・削除に伴う空文字「」の操作を除き、テキストの文字に対する修正等は想定していないため、この方法による問題は生じない。 Note that the speech synthesis symbol editing unit 13 does not assume modification of text characters, except for manipulation of empty characters "" accompanying insertion/deletion of prosodic boundary symbols, so this method does not cause any problems.

また、実装上の工夫として対応関係推定部11においてテキストを文字単位に区切る場合、元の文字間に常に空文字「」を挿入しても良い。常に元のテキストの文字間に空文字「」を挿入する実装では、音声合成記号編集部13で、韻律境界記号の追加・削除に伴う、空文字「」の挿入・削除処理が不要となる。 Further, as a contrivance in terms of implementation, when the correspondence estimation unit 11 divides the text into character units, an empty character "" may always be inserted between the original characters. In an implementation that always inserts a blank character " " between characters of the original text, the text-to-speech symbol editing unit 13 does not need to insert/delete the blank character " " accompanying addition/deletion of prosodic boundary symbols.

＜２＞音声合成記号修正部14に関して、前述では編集インタフェース部123で保持している編集バッファには修正結果を反映しないものとしたが、反映するようにしてもよい。すなわち、音声合成記号修正部14の出力結果と同様の修正を行った結果を、編集結果表示部12で表示し、自動修正結果として反映させるようにしてもよい。 <2> With respect to the speech synthesis symbol correction unit 14, although it was described above that the correction result is not reflected in the editing buffer held by the editing interface unit 123, it may be reflected. That is, the result of correction similar to the output result of the speech synthesis symbol correction unit 14 may be displayed on the editing result display unit 12 and reflected as the automatic correction result.

前述の通り、修正を行わない場合、不正な形式の音声合成記号列が編集結果表示部12に表示されるが、さらなる編集を行う場合にはスムーズに済むこともあるという利点もある一方で、そのまま音声合成できない音声合成記号列が表示されることで、ユーザの利便性を損われる可能性もありうる。従って、変形例として、修正後の結果を表示することでこの問題を回避できる。 As mentioned above, if no correction is made, the malformed text-to-speech symbol string will be displayed in the editing result display section 12. Displaying a speech synthesis symbol string that cannot be synthesized into speech as it is may impair the user's convenience. Therefore, as a modification, this problem can be avoided by displaying the corrected result.

＜３＞編集結果表示部12では音声合成記号トークンを文字情報として表示することに代えて、あるいは、文字情報として表示することに加えて、例えば音声合成記号トークンに対応するピッチパターンをグラフィカルに表示してもよい。 <3> Instead of displaying the speech synthesis symbol token as character information, or in addition to displaying the speech synthesis symbol token as character information, the editing result display unit 12 graphically displays, for example, the pitch pattern corresponding to the speech synthesis symbol token. You may

＜４＞図１で示すような音声合成記号編集装置10の単独構成による利用の他にも、図２を参照して説明したように、音声合成記号編集装置10の前段側にテキスト解析部21を配置し、及び／又は、音声合成記号編集装置10の後段側に音声合成部22を配置して、利用するようにしてもよい。図２では便宜上、合成音声編集装置20としたが、テキスト解析部21及び／又は音声合成部22が音声合成記号編集装置10に含まれるものとして扱ってもよい。 <4> In addition to using the speech synthesis symbol editing device 10 as a single configuration as shown in FIG. and/or the speech synthesizing unit 22 may be arranged on the rear stage side of the speech synthesis symbol editing device 10 and used. In FIG. 2 , the synthetic speech editing device 20 is used for convenience, but the text analysis unit 21 and/or the speech synthesis unit 22 may be included in the speech synthesis symbol editing device 10 .

＜５＞図４は、一般的なコンピュータ装置70におけるハードウェア構成の例を示す図である。音声合成記号編集装置10（あるいは合成音声編集装置20、以下同様）は、このような構成を有する１台以上のコンピュータ装置70として実現可能である。なお、２台以上のコンピュータ装置70で音声合成記号編集装置10を実現する場合、ネットワーク経由で処理に必要な情報の送受を行うようにしてよい。コンピュータ装置70は、所定命令を実行するCPU（中央演算装置）71、CPU71の実行命令の一部又は全部をCPU71に代わって又はCPU71と連携して実行する専用プロセッサとしてのGPU（グラフィックス演算装置）72、CPU71（及びGPU72）にワークエリアを提供する主記憶装置としてのRAM73、補助記憶装置としてのROM74、通信インタフェース75、ディスプレイ76、マウス、キーボード、タッチパネル等によりユーザ入力を受け付ける入力インタフェース77、マイク78と、これらの間でデータを授受するためのバスBSと、を備える。 <5> FIG. 4 is a diagram showing an example of a hardware configuration in a general computer device 70. As shown in FIG. The speech synthesis symbol editing device 10 (or the synthetic speech editing device 20, hereinafter the same) can be implemented as one or more computer devices 70 having such a configuration. When the speech synthesis symbol editing device 10 is realized by two or more computer devices 70, information required for processing may be transmitted and received via a network. The computer device 70 includes a CPU (central processing unit) 71 that executes predetermined instructions, and a GPU (graphics processing unit) as a dedicated processor that executes part or all of the execution instructions of the CPU 71 instead of the CPU 71 or in cooperation with the CPU 71. ) 72, RAM 73 as a main storage device that provides a work area to CPU 71 (and GPU 72), ROM 74 as an auxiliary storage device, communication interface 75, display 76, input interface 77 that accepts user input by mouse, keyboard, touch panel, etc. A microphone 78 and a bus BS for exchanging data therebetween are provided.

音声合成記号編集装置10の各機能部は、各部の機能に対応する所定のプログラムをROM74から読み込んで実行するCPU71及び／又はGPU72によって実現することができる。なお、CPU71及びGPU72は共に、演算装置（プロセッサ）の一種である。ここで、表示関連の処理が行われる場合にはさらに、ディスプレイ76が連動して動作し、データ送受信に関する通信関連の処理が行われる場合にはさらに通信インタフェース75が連動して動作する。音声合成記号編集装置10による処理結果等（編集結果表示部12による表示等）はディスプレイ76で表示して出力してよい。音声合成部22で得た合成音声をマイク78で再生することでユーザに対して出力するようにしてもよい。 Each functional unit of the speech synthesis symbol editing device 10 can be realized by the CPU 71 and/or the GPU 72 that reads and executes a predetermined program corresponding to the function of each unit from the ROM 74 . Both the CPU 71 and the GPU 72 are a kind of arithmetic unit (processor). Here, when display-related processing is performed, the display 76 further operates in conjunction, and when communication-related processing relating to data transmission/reception is performed, the communication interface 75 further operates in conjunction. The result of processing by the speech synthesis symbol editing device 10 (display by the editing result display unit 12, etc.) may be displayed on the display 76 and output. The synthesized speech obtained by the speech synthesizing unit 22 may be reproduced by the microphone 78 and output to the user.

＜６＞本発明の各実施形態に係る音声合成記号編集装置10によれば、テキストから高品質な（自然な発声で構成される）合成音声を得るための編集作業等を効率化することが可能となる。これにより、テキストとして目視で読み取って情報にアクセスすることができない状況にあるハンディキャップを有する人に対しても、あるいは、このような制約が課される環境においても、高品質な合成音声を用意しておくことにより、理解が容易でスムーズな情報アクセスを提供することを、より少ないコストで実現できるようになることから、国連が主導する持続可能な開発目標（ＳＤＧｓ）の目標１０「国内および国家間の不平等を是正する」に貢献することが可能となる。 <6> According to the text-to-speech symbol editing device 10 according to each embodiment of the present invention, it is possible to improve the efficiency of editing work for obtaining high-quality synthetic speech (consisting of natural utterances) from text. It becomes possible. This provides high-quality synthesized speech for people with disabilities who cannot access information by reading it visually as text, or in environments where such constraints are imposed. By doing so, it will be possible to provide easy-to-understand and smooth information access at a lower cost. It will be possible to contribute to "correcting inequality between nations."

10…音声合成記号編集装置、11…対応関係推定部、12…編集結果表示部、13…音声合成記号編集部、123…編集インタフェース部、14…音声合成記号修正部、15…編集記録データベース、16…編集再現部 10...speech synthesis symbol editing device, 11...correspondence estimation unit, 12...editing result display unit, 13...speech synthesis symbol editing unit, 123...editing interface unit, 14...speech synthesis symbol correcting unit, 15...editing record database, 16…Editing Reproduction Section

Claims

A speech synthesis symbol editing device for inputting a set of text and a speech synthesis symbol string corresponding to the text and outputting a speech synthesis symbol string,
dividing the input text into predetermined character units, dividing the input speech synthesis symbol into predetermined speech synthesis symbol tokens, and combining each of the divided characters with zero or more speech synthesis symbol tokens; a correspondence estimating unit that gives the correspondence of based on a predetermined criterion;
an editing interface unit that displays each character given the corresponding relationship and each of zero or more phoneme synthesis symbol tokens while updating the phoneme synthesis symbol tokens by accepting a user editing operation. Text-to-speech symbol editor.

2. The text-to-speech symbol editing apparatus according to claim 1, wherein the user editing operation is modification, addition or deletion of a text-to-speech symbol token.

Of the string of phoneme synthesis symbol tokens updated and displayed on the editing interface unit, if there is a part that violates a predetermined rule for speech synthesis with respect to a substring that has been specified by the user, it conforms to the predetermined rule. 3. The speech synthesis symbol editing apparatus according to claim 1, further comprising a speech synthesis symbol correction unit for outputting a string of phoneme synthesis symbol tokens modified as follows.

4. The speech synthesis symbol editing apparatus according to claim 3, wherein said editing interface unit holds a string of phoneme synthesis symbol tokens without reflecting the correction by said speech synthesis symbol correction unit.

4. The speech synthesis symbol editing apparatus according to claim 3, wherein said editing interface unit holds a string of phoneme synthesis symbol tokens reflecting the correction by said speech synthesis symbol correcting unit.

2. The apparatus further comprises an editing record database for storing, as editing information, user editing operations accepted by the editing interface section in the form of rewriting rules for phoneme synthesis symbol tokens for sets of characters and phoneme synthesis symbol tokens. 6. The speech synthesis symbol editing device according to any one of 1 to 5.

The editing record database is referred to for each character and each string of zero or more phoneme synthesis symbol tokens to which the correspondence relationship is given in the correspondence relationship estimation unit, and the rewriting rule is determined for the locations where the rewriting rule is applicable. 7. The text-to-speech symbol editing device according to claim 6, further comprising: an editing reproduction unit that outputs to the editing interface unit by applying the rewriting rule or as a state in which the rewriting rule is applicable. .

The correspondence estimation unit has a mechanism for generating possible readings for each character of a predetermined character set, and the predetermined criterion in the correspondence estimation unit is 8. A criterion for associating characters with one or more speech synthesis symbol tokens in said speech synthesis symbol string corresponding to the generated reading of each character. 1. The speech synthesis symbol editing device according to claim 1.

9. The method of claim 8, wherein the predetermined character set includes a set of hiragana and katakana, and the mechanism for generating possible readings allows for cases where there are no possible readings for kanji characters. Text-to-speech symbol editor.

10. The correspondence estimating unit according to claim 8, wherein the correspondence estimating unit has a mechanism for treating a predetermined character string in the text as one character to correspond to one or more speech synthesis symbol tokens. text-to-speech symbol editor.

further comprising a text analysis unit that analyzes the text and outputs a corresponding speech synthesis symbol string;
11. The speech synthesis symbol editing apparatus according to claim 1, wherein the correspondence estimation unit receives the speech synthesis symbol string output from the text analysis unit as an input corresponding to the text.

Further comprising a speech synthesizing unit that performs speech synthesis on a substring that has been specified by a user, of the string of phoneme synthesis symbol tokens that are updated and displayed by the editing interface unit, and outputs a corresponding synthesized speech. 12. The speech synthesis symbol editing apparatus according to any one of claims 1 to 11, characterized by:

A speech synthesis symbol editing method for inputting a set of text and a speech synthesis symbol string corresponding to the text and outputting a speech synthesis symbol string,
dividing the input text into predetermined character units, dividing the input speech synthesis symbol into predetermined speech synthesis symbol tokens, and combining each of the divided characters with zero or more speech synthesis symbol tokens; a correspondence estimation step of giving the correspondence of based on a predetermined criterion;
and an editing interface stage for displaying each character given the correspondence and zero or more phoneme synthesis symbol tokens while updating the phoneme synthesis symbol tokens by accepting a user editing operation. How to edit text-to-speech symbols.

A speech synthesis symbol editing device for inputting a set of text and a speech synthesis symbol string corresponding to the text and outputting a speech synthesis symbol string,
dividing the input text into predetermined character units, dividing the input speech synthesis symbol into predetermined speech synthesis symbol tokens, and combining each of the divided characters with zero or more speech synthesis symbol tokens; a correspondence estimating unit that gives the correspondence of based on a predetermined criterion;
a speech synthesis symbol editing device, comprising: an editing interface unit that displays each character given the corresponding relationship and each of zero or more phoneme synthesis symbol tokens while updating the phoneme synthesis symbol tokens by accepting a user editing operation. As
A text-to-speech symbol editing program that causes a computer to function.