JP2008180750A

JP2008180750A - Voice labeling support system

Info

Publication number: JP2008180750A
Application number: JP2007012157A
Authority: JP
Inventors: Tsutomu Kaneyasu; 勉兼安
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2007-01-23
Filing date: 2007-01-23
Publication date: 2008-08-07
Anticipated expiration: 2027-01-23
Also published as: JP4894533B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice labeling support system capable of unifying quality of labeling operation in a specified level without depending on labeler's know-how. <P>SOLUTION: A label image server 200 includes a search section 203 which receives a search label string, and which searches a label image corresponding to that from a label image DB (database) 204a. An operation terminal 300 includes a search request section 301 which requests searching of the label image to the search section 203, and a display section 302. The search request section 301 transmits the label string of voice for performing labeling operation to a label image server 200, and a search section 203 receives the label string from the search request section 301, and searches entry corresponding to the label string from the label image database 204a, and returns the label image associated with the label string to the operation terminal 300. The search request section 301 makes the display section 302 to perform screen display of the label image. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、コーパスベース音声合成方式により音声合成を行う際に行われるラベリング作業を支援するシステムに関するものである。 The present invention relates to a system that supports labeling work performed when speech synthesis is performed by a corpus-based speech synthesis method.

コーパスベース音声合成方式により音声合成を行う場合には、ある話者で任意の単語や文章を読み上げた音声素片の集合により、音声データベースをあらかじめ構築しておく。音声合成の実行時には、この音声データベースから好適な音声素片を選択し、波形接続処理により最終的な合成音声を得る。
このように、コーパスベース音声合成方式においては、音声素片の品質が最終的な合成音声の品質に影響を与えるため、品質の良い音声素片を得ることが重要である。 When speech synthesis is performed by a corpus-based speech synthesis method, a speech database is constructed in advance by a set of speech segments that are read out by a certain speaker as an arbitrary word or sentence. When executing speech synthesis, a suitable speech segment is selected from the speech database, and a final synthesized speech is obtained by waveform connection processing.
In this way, in the corpus-based speech synthesis method, since the quality of speech units affects the quality of the final synthesized speech, it is important to obtain speech units with good quality.

音声素片を得る方法として、あらかじめある話者で音声を収録しておき、その音声波形と実音声を参照しながら、その音声波形中において、音声素片として好ましい位置に区切り符号を付与する（ラベリング）作業を行うことにより音声素片を得る、というものがある。
このラベリング作業は、経験のある作業者が手動でラベリングを行う手動ラベリングと、コンピュータ等による自動ラベリングとに大別される。 As a method of obtaining a speech unit, a speech is recorded in advance by a certain speaker, and a delimiter code is given to a position preferable as a speech unit in the speech waveform while referring to the speech waveform and the actual speech ( There is a method of obtaining a speech segment by performing a labeling operation.
This labeling work is roughly classified into manual labeling in which an experienced worker manually performs labeling and automatic labeling by a computer or the like.

ここで、『自動ラベリングの境界誤差を小とする。』ことを目的とした技術として、『入力音声信号をフレームごとに、複数の帯域にメル周波数分割し（Ｓ１）、各帯域のパワーを求め、また各フレームの音声信号エネルギーを求めてこれらを含む音響特徴量ベクトルを生成し（Ｓ２）、予めこの種の音響特徴量ベクトルを用いて各音韻又は音韻境界についてのＨＭＭ（隠れマルコフモデル）を作っておき、入力音声信号における予め知られている音韻又は音韻境界と対応するＨＭＭの系列と先に求めた特徴量ベクトル系列と尤度が最大となるように計算し（Ｓ３）、その時の音声信号の各フレームに対し、音韻又は音韻境界を表わす情報（ラベル）を付与する（Ｓ４）。』というものが提案されている（特許文献１）。
また、十分な経験（２〜８年）を有するラベラー間では、手動ラベリングによるラベル誤差は小さい、という報告がなされている（非特許文献１）。 Here, “the boundary error of automatic labeling is made small. As a technology for the purpose of the above, “the input audio signal is divided into a plurality of bands for each frame by Mel frequency division (S1), the power of each band is obtained, and the audio signal energy of each frame is obtained and included. An acoustic feature vector is generated (S2), an HMM (Hidden Markov Model) for each phoneme or phoneme boundary is created in advance using this type of acoustic feature vector, and a previously known phoneme in the input speech signal is generated. Alternatively, the HMM sequence corresponding to the phoneme boundary, the previously obtained feature vector sequence, and the likelihood are calculated so as to maximize the likelihood (S3), and information representing the phoneme or phoneme boundary for each frame of the speech signal at that time (Label) is assigned (S4). Is proposed (Patent Document 1).
In addition, it has been reported that label errors due to manual labeling are small between labelers having sufficient experience (2 to 8 years) (Non-patent Document 1).

特開２００４−７７９０１号公報（要約）JP 2004-77901 A (summary) 河井恒、戸田智基、”波形接続型音声合成のための自動音素セグメンテーションの評価”、電子情報通信学会論文誌、Ｖｏｌ．ＳＰ２００２−１７０、ｐｐ．５−１０、０１．２００３（採録）Tsuyoshi Kawai, Tomoki Toda, "Evaluation of automatic phoneme segmentation for waveform-connected speech synthesis", IEICE Transactions, Vol. SP2002-170, pp. 5-10, 01.2003 (accepted)

上述のような自動ラベリング技術によるラベリングの精度は向上しているものの、未だ精度が不十分である箇所が生じる場合もある。このような場合には、経験を積んだラベラーによる手動ラベリングが行われる。
一般に、手動ラベリングは非常に手間のかかる作業であるため、複数のラベラーが共同して作業を行う場合もあり、このような場合には、ラベリング作業は属人的なノウハウによるところが大きい故に、ラベル位置の精度がまちまちになってしまう。
ラベル位置の精度が低下することは、音声素片の品質の低下につながり、最終的には合成音声の品質に影響する。
そのため、ラベラーのノウハウに拠らず、ラベリング作業の品質を一定のレベルで統一することのできる音声ラベリング支援システムが望まれていた。 Although the accuracy of labeling by the automatic labeling technology as described above has been improved, there may be places where the accuracy is still insufficient. In such cases, manual labeling by experienced labelers is performed.
In general, manual labeling is a very time-consuming operation, and there are cases where multiple labelers work together. In such cases, labeling is largely based on personal know-how. Position accuracy will vary.
A decrease in the accuracy of the label position leads to a decrease in the quality of the speech segment, and finally affects the quality of the synthesized speech.
Therefore, an audio labeling support system that can unify the quality of labeling work at a certain level without relying on the know-how of the labeler has been desired.

本発明に係る音声ラベリング支援システムは、
音素環境のラベル列毎のラベルイメージを保持するラベルイメージＤＢを格納した記憶手段を備えるラベルイメージサーバと、
ラベリング作業を行うための作業端末と、
を有する音声ラベリング支援システムであって、
前記ラベルイメージサーバは、
検索ラベル列を受け取り、これに該当するラベルイメージを前記ラベルイメージＤＢから検索する検索部を備え、
前記作業端末は、
前記検索部にラベルイメージの検索を依頼する検索依頼部と、
ラベルイメージを画面表示するための表示部と、
を備え、
前記検索依頼部は、
ラベリング作業を行う音声のラベル列を前記ラベルイメージサーバに送信し、
前記検索部は、
前記検索依頼部よりラベル列を受け取り、そのラベル列に該当するエントリを前記ラベルイメージＤＢから検索し、そのラベル列に対応付けられたラベルイメージを前記作業端末に返信し、
前記検索依頼部は、
前記検索部よりラベルイメージを受け取り、前記表示部にそのラベルイメージを画面表示させる
ことを特徴とするものである。 An audio labeling support system according to the present invention includes:
A label image server comprising storage means for storing a label image DB for holding a label image for each label row of the phonemic environment;
A work terminal for labeling work,
An audio labeling support system comprising:
The label image server
A search unit that receives a search label string and searches the label image DB for a corresponding label image;
The work terminal is
A search request unit that requests the search unit to search for a label image;
A display for displaying the label image on the screen;
With
The search request unit
Send the label sequence of the audio to be labeled to the label image server,
The search unit
A label string is received from the search request unit, an entry corresponding to the label string is searched from the label image DB, and a label image associated with the label string is returned to the work terminal,
The search request unit
A label image is received from the search unit, and the label image is displayed on the screen on the display unit.

本発明に係る音声ラベリング支援システムによれば、ラベルイメージＤＢが保持しているラベルイメージを作業端末上に表示して、これを参照しながらラベリング作業を行うことができるため、ラベラー間のノウハウ等の差異によらず、ラベル位置の精度を統一的に向上させることができる。
また、その結果、コーパスベース音声合成方式における音声ＤＢと、これを用いて生成する合成音声の品質も、向上させることができる。 According to the audio labeling support system according to the present invention, the label image held in the label image DB can be displayed on the work terminal and the labeling work can be performed while referring to the label image. Regardless of the difference, the accuracy of the label position can be improved uniformly.
As a result, the speech DB in the corpus-based speech synthesis method and the quality of synthesized speech generated using the speech DB can be improved.

実施の形態１．
図１は、本発明の実施の形態１に係る音声ラベリング支援システムの構成を示すものである。
図１の音声ラベリング支援システムは、ラベルイメージ登録端末１００、ラベルイメージサーバ２００、ラベリング作業端末３００を有する。これらはネットワーク４００を介して接続されている。 Embodiment 1 FIG.
FIG. 1 shows a configuration of an audio labeling support system according to Embodiment 1 of the present invention.
The voice labeling support system of FIG. 1 includes a label image registration terminal 100, a label image server 200, and a labeling work terminal 300. These are connected via the network 400.

ラベルイメージ登録端末１００は、ラベルイメージをラベルイメージサーバ２００に登録するための端末であり、ラベル列送信部１０１と、ラベルイメージ送信部１０２を備える。
ラベル列送信部１０１は、音素ラベル列をラベルイメージサーバ２００に送信して、その音素ラベル列に対応するラベルイメージが既に登録されているか否かを確認するよう、ラベルイメージサーバ２００に依頼する。
ラベルイメージ送信部１０２は、ラベルイメージをラベルイメージサーバ２００に送信し、そのラベルイメージを登録するように依頼する。 The label image registration terminal 100 is a terminal for registering a label image in the label image server 200, and includes a label string transmission unit 101 and a label image transmission unit 102.
The label sequence transmission unit 101 transmits the phoneme label sequence to the label image server 200 and requests the label image server 200 to check whether or not a label image corresponding to the phoneme label sequence has already been registered.
The label image transmission unit 102 transmits a label image to the label image server 200 and requests to register the label image.

ラベルイメージサーバ２００は、ラベルイメージを保持し、ラベリング作業端末３００からラベルイメージ取得要求を受けた際に、該当するラベルイメージを返送するためのサーバであり、ラベル登録判定部２０１、ラベルイメージ登録部２０２、ラベルイメージ検索部２０３、記憶手段２０４を備える。
ラベル登録判定部２０１は、音素ラベル列を受け取り、その音素ラベル列に対応するラベルイメージが、後述のラベルイメージＤＢ（Ｄａｔａｂａｓｅの略、以下同じ）２０４ａに保持されているか否かを判定して結果を返信する。
ラベルイメージ登録部２０２は、ラベルイメージを受け取り、そのラベルイメージを後述のラベルイメージＤＢ２０４ａに格納する。
ラベルイメージ検索部２０３は、音素ラベル列を受け取り、その音素ラベル列に対応するラベルイメージを、後述のラベルイメージＤＢ２０４ａから検索して返信する。
記憶手段２０４は、後述の図３で説明するラベルイメージＤＢ２０４ａを格納している。 The label image server 200 is a server for holding a label image and returning the corresponding label image when receiving a label image acquisition request from the labeling work terminal 300. The label image determination unit 201, the label image registration unit 202, a label image search unit 203, and a storage unit 204.
The label registration determination unit 201 receives a phoneme label string, determines whether a label image corresponding to the phoneme label string is held in a label image DB (abbreviation of Database, hereinafter the same) 204a described later, and the result. Reply.
The label image registration unit 202 receives a label image and stores the label image in a label image DB 204a described later.
The label image search unit 203 receives a phoneme label string, searches for a label image corresponding to the phoneme label string from a label image DB 204a described later, and returns the label image.
The storage unit 204 stores a label image DB 204a described later with reference to FIG.

ラベリング作業端末３００は、ラベリング作業を行うための端末で、ラベルイメージ検索依頼部３０１、表示部３０２を備える。
ラベルイメージ検索依頼部３０１は、音素ラベル列をラベルイメージサーバ２００に送信して、その音素ラベル列に対応するラベルイメージをラベルイメージＤＢ２０４ａから検索し、返送するよう依頼する。
表示部３０２は、ラベルイメージ検索依頼部３０１がラベルイメージサーバ２００より取得したラベルイメージを、後述の図５で説明するような画面構成で表示する。 The labeling work terminal 300 is a terminal for performing a labeling work, and includes a label image search request unit 301 and a display unit 302.
The label image search request unit 301 transmits a phoneme label string to the label image server 200, searches the label image DB 204a for a label image corresponding to the phoneme label string, and requests to return it.
The display unit 302 displays the label image acquired by the label image search request unit 301 from the label image server 200 in a screen configuration as described later with reference to FIG.

ラベル列送信部１０１、ラベルイメージ送信部１０２、ラベル登録判定部２０１、ラベルイメージ登録部２０２、ラベルイメージ検索部２０３、及びラベルイメージ検索依頼部３０１は、これらの機能を実行する回路デバイスのようなハードウェアで実現することもできるし、マイコンやＣＰＵのような演算装置上で動作するソフトウェアとして実現することもできる。
記憶手段２０４は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）のような、比較的容量の大きい記憶装置で構成することが望ましい。
表示部３０２は、ディスプレイデバイスのような画面表示装置と、これを制御するドライバソフトウェア等の制御機能により構成することができる。
また、各端末及びサーバは、必要なネットワークインターフェースを備えているものとする。 The label string transmission unit 101, the label image transmission unit 102, the label registration determination unit 201, the label image registration unit 202, the label image search unit 203, and the label image search request unit 301 are like circuit devices that execute these functions. It can be realized by hardware, or can be realized as software operating on an arithmetic device such as a microcomputer or CPU.
The storage means 204 is preferably composed of a storage device having a relatively large capacity, such as an HDD (Hard Disk Drive).
The display unit 302 can be configured by a screen display device such as a display device and a control function such as driver software for controlling the screen display device.
Each terminal and server is assumed to have a necessary network interface.

ここで、図１の音声ラベリング支援システムの動作説明に入る前に、構成に関する補足説明をしておく。 Here, before entering the explanation of the operation of the voice labeling support system of FIG.

まず、本発明における「音素（環境の）ラベル列」の一例を示す。
例えばテキストで「おはよう」に相当するもののラベル列とは、「おはよう」を音素記号で表したものであり、「ｏ−ｈ＋ａ」「ｈ−ａ＋ｙ」「ａ−ｙ＋ｌｏ」「ｙ−ｌｏ＋ｓｌｔ」のように表すことができる。
ここで、「−」「＋」は音素の前後のつながり、「ｌｏ」は「ｏ」の長母音、「ｓｌｔ」は末尾の無音を表している。 First, an example of a “phoneme (environment) label string” in the present invention is shown.
For example, a label string corresponding to “good morning” in text is a representation of “good morning” with a phoneme symbol, such as “o−h + a”, “ha−y +”, “a−y + lo”, and “y−lo + slt”. Can be expressed as
Here, “−” and “+” indicate the connection before and after phonemes, “lo” indicates the long vowel of “o”, and “slt” indicates the end silence.

図２は、ラベルイメージの１例を示すものである。
本実施の形態１でいうラベルイメージとは、あるテキストをある話者に発生させた際の音声波形イメージを音素境界で切り分け、波形イメージと音素境界を併せて画像イメージとして記録したものである。画像イメージのフォーマットは、後に説明するラベリング作業端末３００の表示部３０２にて表示可能なもの（例えばＪＰＥＧやビットマップのような標準的なフォーマット）としておく。
図２において、ある音素環境のラベル列「ｉ−ｘｓｈ＋ｌｕ」を発生した際の波形イメージが表されており、さらにこれを、音素「ｉ」「ｘｓｈ」「ｌｕ」に切り分ける際の境界が、縦線により表されている。 FIG. 2 shows an example of a label image.
The label image referred to in the first embodiment is obtained by dividing a speech waveform image when a certain text is generated by a speaker at a phoneme boundary and recording the waveform image and the phoneme boundary together as an image image. The format of the image is set so that it can be displayed on the display unit 302 of the labeling work terminal 300 described later (for example, a standard format such as JPEG or bitmap).
FIG. 2 shows a waveform image when a label string “i-xsh + lu” is generated in a certain phonemic environment, and the boundary when dividing this into phonemes “i”, “xsh”, and “lu” is a vertical line. It is represented by a line.

手動ラベリング作業において、このように音素の境界を定めていく作業が行われるが、いずれの箇所を境界位置とするかはラベラー個人のノウハウに依拠する。
そこで、熟練したラベラーがラベリング作業を行った際に、図２のようなラベルイメージを取得して蓄積しておき、他のラベラーがラベリング作業を行う際に、これを参照しながらラベリング作業を行うことを考える。
本発明は、このような着想に基づくものであり、ラベルイメージサーバ２００が上述の蓄積機能を備える。 In manual labeling work, the work of defining the boundary of phonemes is performed in this way, and it is dependent on the labeler's individual know-how which part is set as the boundary position.
Therefore, when a skilled labeler performs a labeling operation, the label image as shown in FIG. 2 is acquired and accumulated, and when another labeler performs the labeling operation, the labeling operation is performed with reference to this labeling operation. Think about it.
The present invention is based on such an idea, and the label image server 200 has the storage function described above.

図３は、ラベルイメージＤＢ２０４ａの構成とデータ例を示すものである。
ラベルイメージＤＢ２０４ａは、「ラベル列」列と「ラベルイメージ」列を有する。
「ラベル列」列には、音素ラベル列が格納されている。
「ラベルイメージ」列には、「ラベル列」列の値で特定される音素ラベル列に対応したラベルイメージが格納されている。本列に格納されているラベルイメージは、図２で説明したような、波形データとラベル位置を併せて示す画像データである。 FIG. 3 shows the configuration and data example of the label image DB 204a.
The label image DB 204a has a “label column” column and a “label image” column.
The “label string” column stores a phoneme label string.
The “label image” column stores a label image corresponding to the phoneme label column specified by the value of the “label column” column. The label image stored in this column is image data indicating both the waveform data and the label position as described in FIG.

次に、本実施の形態１における音声ラベリング支援システムの動作について説明する。 Next, the operation of the voice labeling support system according to the first embodiment will be described.

図４は、ラベルイメージ登録端末１００がラベルイメージを登録する際の処理シーケンスを説明するものである。ここでは、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを登録する際のシーケンスを例に取る。
以下、各ステップについて説明する。 FIG. 4 illustrates a processing sequence when the label image registration terminal 100 registers a label image. Here, a sequence when registering a label image corresponding to the phoneme label string “ka−i + i” is taken as an example.
Hereinafter, each step will be described.

（０）事前作業
ここでは、ラベルイメージ登録端末１００のオペレータは十分なラベリング作業経験を有し、事前作業として、同端末で音素ラベル列「ｋ−ａ＋ｉ」についてのラベリング作業を実施したものとする。 (0) Preliminary work Here, it is assumed that the operator of the label image registration terminal 100 has sufficient labeling work experience, and as the pre-work, the labeling work for the phoneme label string “ka + i” is performed on the terminal. .

（１）「ｋ−ａ＋ｉ」のラベル列を送信
オペレータは、図示しないラベルイメージ登録端末１００の操作部を操作し、ラベルイメージサーバ２００に対し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージが既に登録されているか否かを判定するよう要求する。
ラベル列送信部１０１は、上述の操作指示を受けて、音素ラベル列「ｋ−ａ＋ｉ」をラベルイメージサーバ２００に送信し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージがラベルイメージＤＢ２０４ａに登録されているか否か検索するよう要求する。
検索要求は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (1) Transmit the label sequence of “k−a + i” The operator operates an operation unit of the label image registration terminal 100 (not shown) to the label image server 200 to provide a label image corresponding to the phoneme label sequence “ka−i + i”. Request to determine whether or not is already registered.
Upon receiving the above operation instruction, the label string transmitting unit 101 transmits the phoneme label string “ka−i + i” to the label image server 200, and the label image corresponding to the phoneme label string “ka−i +” is stored in the label image DB 204a. Request to search if it is registered.
The search request reaches the label image server 200 as a packet transmitted through the network 400.

（２）「ｋ−ａ＋ｉ」のラベル列を検索する
ラベル登録判定部２０１は、ステップ（１）でラベル列送信部１０１が送信した要求パケットを受け取る。
次に、ラベル登録判定部２０１は、受け取ったラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、対応するラベルイメージが登録されているか否かを判定する。
ここでは、対応するラベルイメージが検索にヒットせず、登録されていないものと判定したとする。 (2) Search for a Label Sequence “k−a + i” The label registration determination unit 201 receives the request packet transmitted by the label sequence transmission unit 101 in step (1).
Next, the label registration determination unit 201 searches the label image DB 204a using the received label string “k−a + i” as a key, and determines whether or not the corresponding label image is registered.
Here, it is assumed that the corresponding label image does not hit the search and is determined not to be registered.

（３）ラベル列の検索結果
ラベル登録判定部２０１は、ラベル列「ｋ−ａ＋ｉ」についての検索結果を、ラベルイメージ登録端末１００に返信する。 (3) Label String Search Result The label registration determination unit 201 returns the search result for the label string “ka−i” to the label image registration terminal 100.

（４）送信依頼
ラベル列送信部１０１は、ラベルイメージ送信部１０２に対し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージをラベルイメージサーバ２００に送信するように依頼する。 (4) Transmission Request The label sequence transmission unit 101 requests the label image transmission unit 102 to transmit a label image corresponding to the phoneme label sequence “k−a + i” to the label image server 200.

（５）「ｋ−ａ＋ｉ」のラベルイメージ送信
ラベルイメージ送信部１０２は、オペレータが事前作業としてラベリングを実施した、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを、ラベルイメージサーバ２００に送信する。
ラベルイメージは、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (5) Label image transmission of “k−a + i” The label image transmission unit 102 transmits to the label image server 200 a label image corresponding to the phoneme label string “ka−i”, which has been labeled by the operator as a preliminary operation. To do.
The label image reaches the label image server 200 as a packet transmitted through the network 400.

（６）「ｋ−ａ＋ｉ」のラベルイメージ登録
ラベルイメージ登録部２０２は、ステップ（５）でラベルイメージ送信部１０２が送信したラベルイメージパケットを受け取る。
次に、ラベルイメージ登録部２０２は、受け取ったラベルイメージをラベルイメージＤＢ２０４ａに登録する。ここでいう「登録する」とは、図３で説明したような構成でエントリを新たに追加することをいう。 (6) Label image registration of “k−a + i” The label image registration unit 202 receives the label image packet transmitted by the label image transmission unit 102 in step (5).
Next, the label image registration unit 202 registers the received label image in the label image DB 204a. “Register” here means adding a new entry with the configuration described in FIG.

以上の処理により、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージがラベルイメージＤＢ２０４ａに登録された。
次に、ラベルイメージＤＢ２０４ａに登録されたラベルイメージを利用する手順について説明する。 Through the above processing, the label image corresponding to the phoneme label string “ka + i” is registered in the label image DB 204a.
Next, a procedure for using a label image registered in the label image DB 204a will be described.

図５は、ラベリング作業端末３００の表示部３０２に表示される、ラベリング作業画面の構成例である。以下、図５を参照しながら、ラベリング作業端末３００のオペレータが行う作業について説明する。 FIG. 5 is a configuration example of a labeling work screen displayed on the display unit 302 of the labeling work terminal 300. Hereinafter, the work performed by the operator of the labeling work terminal 300 will be described with reference to FIG.

（１）音声波形データの読み込み
オペレータは、ラベリングを行う音声波形データを読み込むように、図示しない操作部を操作してラベリング作業端末３００に指示を与える。
読み込まれた音声波形データに該当する波形イメージが、図５における「１」の部分に表示される。 (1) Reading voice waveform data The operator operates the operation unit (not shown) to give instructions to the labeling work terminal 300 so as to read voice waveform data to be labeled.
A waveform image corresponding to the read audio waveform data is displayed in a portion “1” in FIG.

（２）ラベリング箇所の拡大
オペレータは、ラベリングを行う箇所の音声波形イメージを拡大するように、ラベリング作業端末３００に指示を与える。
拡大を指示した箇所の拡大波形イメージが、図５における「２」の部分に表示される。 (2) Enlarging the Labeling Location The operator gives an instruction to the labeling work terminal 300 so as to enlarge the speech waveform image at the location where labeling is performed.
An enlarged waveform image of the location where the enlargement is instructed is displayed in a portion “2” in FIG.

（３）ラベル列の送信
オペレータは、図５における「５」の部分に音素ラベル列を入力し、「送信」ボタンを押下する。 (3) Transmission of label string The operator inputs a phoneme label string in the portion “5” in FIG. 5 and presses the “Send” button.

（４）ラベルイメージの取得、表示
ラベリング作業端末３００は、後述の図６の処理により、オペレータがステップ（３）で入力した音素ラベル列に相当するラベルイメージをラベルイメージサーバ２００から取得し、図５における「４」の部分に表示する。 (4) Acquisition and display of label image The labeling work terminal 300 acquires a label image corresponding to the phoneme label string input by the operator in step (3) from the label image server 200 by the process of FIG. 5 is displayed at a portion “4”.

（５）ラベル位置の設定
オペレータは、図５における「４」の部分に表示されたラベルイメージを参照しながら、「３」の部分を、図示しない操作部を操作することにより移動させる。この位置がラベル位置として設定されることになる。 (5) Setting of Label Position The operator moves the part “3” by operating an operation unit (not shown) while referring to the label image displayed in the part “4” in FIG. This position is set as the label position.

このように、ラベリングを行おうとしている音素ラベル列に対応した（もしくは最も近い）ラベルイメージを、一種の作業マニュアルとして参照しながらラベリング作業を行うことができるので、ラベル位置の精度統一と向上を図ることができる。 In this way, labeling can be performed while referring to the label image corresponding to (or closest to) the phoneme label string to be labeled as a kind of work manual, so the accuracy and accuracy of the label position can be unified and improved. Can be planned.

図６は、ラベリング作業端末３００のオペレータが、図５で説明したような画面上でラベリング作業を行う際に、ラベリング作業端末３００が実行する内部的な処理シーケンスを説明するものである。
以下、各ステップについて説明する。 FIG. 6 illustrates an internal processing sequence executed by the labeling work terminal 300 when the operator of the labeling work terminal 300 performs the labeling work on the screen as described with reference to FIG.
Hereinafter, each step will be described.

（１）「ｋ−ａ＋ｉ」のラベル列を送信
オペレータは、図示しないラベリング作業端末３００の操作部を操作し、ラベルイメージサーバ２００に対し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを送信するよう要求する。
ラベルイメージ検索依頼部３０１は、上述の操作指示を受けて、音素ラベル列「ｋ−ａ＋ｉ」をラベルイメージサーバ２００に送信し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを検索するよう要求する。
検索要求は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。
なお、本ステップは、図５で説明したステップ（３）における、ラベリング作業端末３００の内部動作に相当する。 (1) Sending a Label Sequence of “k−a + i” The operator operates an operation unit of a labeling work terminal 300 (not shown) and sends a label image corresponding to the phoneme label sequence “ka−i” to the label image server 200. Request to send.
In response to the above operation instruction, the label image search request unit 301 transmits the phoneme label string “ka−i + i” to the label image server 200 and searches for the label image corresponding to the phoneme label string “ka−i + i”. Request.
The search request reaches the label image server 200 as a packet transmitted through the network 400.
This step corresponds to the internal operation of the labeling work terminal 300 in step (3) described with reference to FIG.

（２）「ｋ−ａ＋ｉ」のラベル列を検索する
ラベルイメージ検索部２０３は、ステップ（１）でラベルイメージ検索依頼部３０１が送信した要求パケットを受け取る。
次に、ラベルイメージ検索部２０３は、受け取ったラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、対応するラベルイメージを取得する。
ここでは、対応するラベルイメージがラベルイメージＤＢ２０４ａに登録済みであるものとする。 (2) Searching for a Label Sequence of “k−a + i” The label image search unit 203 receives the request packet transmitted by the label image search request unit 301 in step (1).
Next, the label image search unit 203 searches the label image DB 204a using the received label string “k−a + i” as a key, and acquires a corresponding label image.
Here, it is assumed that the corresponding label image has been registered in the label image DB 204a.

（３）「ｋ−ａ＋ｉ」のラベルイメージ送信
ラベルイメージ検索部２０３は、ステップ（２）で取得したラベルイメージを、ラベリング作業端末３００に送信する。
ラベルイメージは、ネットワーク４００を伝送するパケットとして、ラベリング作業端末３００に到達する。 (3) Label image transmission of “k−a + i” The label image search unit 203 transmits the label image acquired in step (2) to the labeling work terminal 300.
The label image reaches the labeling work terminal 300 as a packet transmitted through the network 400.

（４）表示依頼
ラベルイメージ検索依頼部３０１は、表示部３０２に対し、取得したラベルイメージを画面表示するように依頼する。
表示部３０２は、ラベルイメージ検索依頼部３０１が取得したラベルイメージを画面表示する。
なお、本ステップは、図５で説明したステップ（４）における、ラベリング作業端末３００の内部動作に相当する。 (4) Display Request The label image search request unit 301 requests the display unit 302 to display the acquired label image on the screen.
The display unit 302 displays the label image acquired by the label image search request unit 301 on the screen.
This step corresponds to the internal operation of the labeling work terminal 300 in step (4) described with reference to FIG.

以上のように、本実施の形態１によれば、ラベルイメージＤＢが保持しているラベルイメージを作業端末上に表示して、これを参照しながらラベリング作業を行うことができるため、ラベラー間のノウハウ等の差異によらず、ラベル位置の精度を統一的に向上させることができる。
また、その結果、コーパスベース音声合成方式における音声ＤＢと、これを用いて生成する合成音声の品質も、向上させることができる。 As described above, according to the first embodiment, the label image held in the label image DB can be displayed on the work terminal and the labeling work can be performed while referring to the label image. Regardless of the difference in know-how, the accuracy of the label position can be improved uniformly.
As a result, the speech DB in the corpus-based speech synthesis method and the quality of synthesized speech generated using the speech DB can be improved.

実施の形態２．
図７は、本発明の実施の形態２に係る音声ラベリング支援システムの構成を示すものである。
図７の音声ラベリング支援システムにおけるラベルイメージサーバ２００は、図１の構成に加えて新たにデフォルトラベルイメージ記憶手段２０５を備えている。その他の構成は図１と同様であるため、説明を省略する。
デフォルトラベルイメージ記憶手段２０５は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）のような、比較的容量の大きい記憶装置で構成することが望ましい。 Embodiment 2. FIG.
FIG. 7 shows a configuration of an audio labeling support system according to Embodiment 2 of the present invention.
The label image server 200 in the voice labeling support system of FIG. 7 is newly provided with a default label image storage unit 205 in addition to the configuration of FIG. Other configurations are the same as those in FIG.
The default label image storage unit 205 is preferably configured by a storage device having a relatively large capacity, such as an HDD (Hard Disk Drive).

デフォルトラベルイメージ記憶手段２０５は、デフォルトラベルイメージＤＢ２０５ａを格納している。
デフォルトラベルイメージＤＢ２０５ａの構成は、図３で説明したラベルイメージＤＢ２０４ａと同様であるが、ラベルイメージＤＢ２０４ａが保持するラベルイメージは、ラベルイメージ登録端末１００から送信するのに対し、デフォルトラベルイメージＤＢ２０５ａが保持するラベルイメージは、あらかじめ規定の話者の発声に基づき生成したラベルイメージを格納したものである点が異なる。
なお、図７では記憶手段２０４とデフォルトラベルイメージ記憶手段２０５を別々に設けたが、これらの記憶手段を一体的に構成して２つのＤＢを合わせて格納してもよい。後述の実施の形態においても同様である。 The default label image storage unit 205 stores a default label image DB 205a.
The configuration of the default label image DB 205a is the same as that of the label image DB 204a described in FIG. 3, but the label image held by the label image DB 204a is transmitted from the label image registration terminal 100, whereas the default label image DB 205a holds the label image. The difference is that the label image is a label image generated in advance based on the utterance of a prescribed speaker.
In FIG. 7, the storage unit 204 and the default label image storage unit 205 are provided separately. However, these storage units may be integrally configured to store two DBs together. The same applies to the embodiments described later.

本実施の形態２における「規定ラベルイメージＤＢ」は、デフォルトラベルイメージＤＢ２０５ａがこれに相当する。 The “specified label image DB” in the second embodiment corresponds to the default label image DB 205a.

図８は、本実施の形態２において、ラベリング作業端末３００のオペレータが、図５で説明したような画面上でラベリング作業を行う際に、ラベリング作業端末３００が実行する内部的な処理シーケンスを説明するものである。
各ステップの処理は、概ね図６で説明したものと同様であるが、ステップ（２）〜（３）における処理が異なるため、これについて説明する。 FIG. 8 illustrates an internal processing sequence executed by the labeling work terminal 300 when the operator of the labeling work terminal 300 performs the labeling work on the screen as described in FIG. 5 in the second embodiment. To do.
The processing in each step is substantially the same as that described with reference to FIG. 6, but the processing in steps (2) to (3) is different and will be described.

（２）「ｋ−ａ＋ｉ」のラベル列を検索する
ラベルイメージ検索部２０３は、ステップ（１）でラベルイメージ検索依頼部３０１が送信した要求パケットを受け取る。
次に、ラベルイメージ検索部２０３は、受け取ったラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、対応するラベルイメージを取得する。
ここで、対応するラベルイメージがラベルイメージＤＢ２０４ａに登録されていなかったものとする。
この場合、ラベルイメージ検索部２０３は、デフォルトラベルイメージＤＢ２０５ａが保持しているラベルイメージの中で、音素ラベル列が「ｋ−ａ＋ｉ」と同一、又はこれと最も近いものを検索し、該当するラベルイメージを取得する。 (2) Searching for a Label Sequence of “k−a + i” The label image search unit 203 receives the request packet transmitted by the label image search request unit 301 in step (1).
Next, the label image search unit 203 searches the label image DB 204a using the received label string “k−a + i” as a key, and acquires a corresponding label image.
Here, it is assumed that the corresponding label image has not been registered in the label image DB 204a.
In this case, the label image search unit 203 searches the label image stored in the default label image DB 205a for a phoneme label string that is the same as or closest to “k−a + i”, and the corresponding label. Get an image.

（３）「ｋ−ａ＋ｉ」に最も近いラベルイメージ送信
ラベルイメージ検索部２０３は、ステップ（２）で取得したラベルイメージを、ラベリング作業端末３００に送信する。
ラベルイメージは、ネットワーク４００を伝送するパケットとして、ラベリング作業端末３００に到達する。 (3) Label image transmission closest to “k−a + i” The label image search unit 203 transmits the label image acquired in step (2) to the labeling work terminal 300.
The label image reaches the labeling work terminal 300 as a packet transmitted through the network 400.

以上のように、本実施の形態２によれば、ラベルイメージＤＢ２０４ａに該当するラベルイメージが格納されていない場合であっても、標準的な話者の発声に基づきあらかじめデフォルトラベルイメージＤＢ２０５ａを構築しておくことにより、ラベリング作業を行う際に参照するラベルイメージが全くないという事態を回避できる。
例えば、ラベルイメージ登録端末１００からラベルイメージＤＢ２０４ａに登録したラベルイメージの数が十分でない段階でラベリング作業を行わざるを得ないような場合であっても、少なくとも標準的な話者の発声に基づくラベルイメージが得られるため、ラベリング作業の精度を一定レベルに保つことができる。 As described above, according to the second embodiment, even if the label image corresponding to the label image DB 204a is not stored, the default label image DB 205a is constructed in advance based on the utterance of a standard speaker. By doing so, it is possible to avoid a situation in which there is no label image to be referred to when performing the labeling operation.
For example, even if the labeling operation is unavoidable when the number of label images registered in the label image DB 204a from the label image registration terminal 100 is insufficient, a label based on at least a standard speaker's utterance Since an image is obtained, the accuracy of the labeling operation can be maintained at a certain level.

実施の形態３．
本発明の実施の形態３では、ラベルイメージサーバ２００において、複数の話者の発声に基づき生成したラベルイメージを格納している構成例について説明する。
なお、本実施の形態３に係る音声ラベリング支援システムの構成は、ラベルイメージＤＢ２０４ａとデフォルトラベルイメージＤＢ２０５ａの構成を除き実施の形態２で説明したものと同様であるため、説明を省略する。 Embodiment 3 FIG.
In the third embodiment of the present invention, a configuration example in which label images generated based on the utterances of a plurality of speakers are stored in the label image server 200 will be described.
The configuration of the audio labeling support system according to the third embodiment is the same as that described in the second embodiment except for the configurations of the label image DB 204a and the default label image DB 205a, and thus the description thereof is omitted.

図９は、本実施の形態３におけるラベルイメージＤＢ２０４ａの構成とデータ例を示すものである。
図９において、図３で説明した構成に加え、新たに「話者名」列が追加されている。
「話者名」列には、話者を特定する情報、例えば氏名などが格納される。
「ラベル列」列には、音素ラベル列が格納されている。
「ラベルイメージ」列には、「話者名」列の値で特定される話者と「ラベル列」列の値で特定される音素ラベル列に対応したラベルイメージが格納されている。 FIG. 9 shows the configuration and data example of the label image DB 204a in the third embodiment.
In FIG. 9, a “speaker name” column is newly added in addition to the configuration described in FIG. 3.
The “speaker name” column stores information for identifying a speaker, such as a name.
The “label string” column stores a phoneme label string.
The “label image” column stores a label image corresponding to the speaker specified by the value of the “speaker name” column and the phoneme label column specified by the value of the “label column” column.

即ち、本実施の形態３におけるラベルイメージＤＢ２０４ａには、複数の話者の発声に基づき生成したラベルイメージが格納されており、同じ音素ラベル列であっても、複数のラベルイメージを格納している場合もある。
このように、複数の話者のラベルイメージを格納しているのは、同じ音素ラベル列について発声したものであっても、話者によっては適切なラベル位置が異なる場合もあるからである。従ってラベルイメージＤＢ２０４ａには、ラベリング作業を行う音声の話者毎にラベルイメージを保持しておくことが望ましく、図９のようなデータ構成によりこれを実現している。
なお、デフォルトラベルイメージＤＢ２０５ａについても図９と同様の構成を備えることができる。 That is, the label image DB 204a according to the third embodiment stores label images generated based on the utterances of a plurality of speakers, and stores a plurality of label images even in the same phoneme label string. In some cases.
The reason why the label images of a plurality of speakers are stored in this way is that even if the speech is made for the same phoneme label string, the appropriate label position may differ depending on the speaker. Therefore, it is desirable to store a label image for each voice speaker who performs labeling work in the label image DB 204a, and this is realized by a data structure as shown in FIG.
The default label image DB 205a can have the same configuration as that in FIG.

図１０は、ラベルイメージ登録端末１００がラベルイメージを登録する際の処理シーケンスを説明するものである。ここでは、話者「Ａ」と音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを登録する際のシーケンスを例に取る。
以下、各ステップについて説明する。 FIG. 10 illustrates a processing sequence when the label image registration terminal 100 registers a label image. Here, a sequence when registering label images corresponding to the speaker “A” and the phoneme label string “ka + i” is taken as an example.
Hereinafter, each step will be described.

（０）事前作業
ここでは、ラベルイメージ登録端末１００のオペレータは十分なラベリング作業経験を有し、事前作業として、同端末で話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」についてのラベリング作業を実施したものとする。 (0) Pre-work Here, the operator of the label image registration terminal 100 has sufficient labeling work experience, and as a pre-work, the phoneme label string “k−a + i” by the utterance of the speaker “A” at the terminal is used. It is assumed that labeling work has been performed.

（１）話者名「Ａ」と「ｋ−ａ＋ｉ」のラベル列を送信
オペレータは、図示しないラベルイメージ登録端末１００の操作部を操作し、ラベルイメージサーバ２００に対し、話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージが既に登録されているか否かを判定するよう要求する。
ラベル列送信部１０１は、上述の操作指示を受けて、話者名「Ａ」と音素ラベル列「ｋ−ａ＋ｉ」をラベルイメージサーバ２００に送信し、話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージがラベルイメージＤＢ２０４ａに登録されているか否か検索するよう要求する。
検索要求は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (1) Transmit the label string of the speaker names “A” and “ka + i” The operator operates the operation unit of the label image registration terminal 100 (not shown) to the label image server 200 for the speaker “A”. A request is made to determine whether or not a label image corresponding to the phoneme label string “ka + i” by utterance has already been registered.
In response to the above operation instruction, the label string transmission unit 101 transmits the speaker name “A” and the phoneme label string “ka−i + i” to the label image server 200, and the phoneme label string generated by the utterance of the speaker “A”. A request is made to search whether a label image corresponding to “k−a + i” is registered in the label image DB 204a.
The search request reaches the label image server 200 as a packet transmitted through the network 400.

（２）話者名「Ａ」で「ｋ−ａ＋ｉ」のラベル列を検索する
ラベル登録判定部２０１は、ステップ（１）でラベル列送信部１０１が送信した要求パケットを受け取る。
次に、ラベル登録判定部２０１は、受け取った話者名「Ａ」とラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、対応するラベルイメージが登録されているか否かを判定する。
判定は、話者名「Ａ」とラベル列「ｋ−ａ＋ｉ」の組み合わせがラベルイメージＤＢ２０４ａに登録されているか否かによる。即ち、いずれか一方のみが存在していても、検索条件に合致しているとはみなされない。
ここでは、対応するラベルイメージが検索にヒットせず、登録されていないものと判定したとする。 (2) Search for the label string of “ka + i” with the speaker name “A” The label registration determination unit 201 receives the request packet transmitted by the label string transmission unit 101 in step (1).
Next, the label registration determination unit 201 searches the label image DB 204a using the received speaker name “A” and the label string “ka + i” as keys, and determines whether or not the corresponding label image is registered. To do.
The determination is based on whether or not a combination of the speaker name “A” and the label string “k−a + i” is registered in the label image DB 204a. That is, even if only one of them exists, it is not considered that the search condition is met.
Here, it is assumed that the corresponding label image does not hit the search and is determined not to be registered.

（３）ラベル列の検索結果
ラベル登録判定部２０１は、話者名「Ａ」とラベル列「ｋ−ａ＋ｉ」についての検索結果を、ラベルイメージ登録端末１００に返信する。 (3) Label String Search Result The label registration determination unit 201 returns a search result for the speaker name “A” and the label string “ka−i +” to the label image registration terminal 100.

（４）送信依頼
ラベル列送信部１０１は、ラベルイメージ送信部１０２に対し、話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージをラベルイメージサーバ２００に送信するように依頼する。 (4) Transmission Request The label sequence transmission unit 101 transmits to the label image server 102 a label image corresponding to the phoneme label sequence “k−a + i” produced by the speaker “A”. To ask.

（５）話者名「Ａ」と「ｋ−ａ＋ｉ」のラベルイメージ送信
ラベルイメージ送信部１０２は、オペレータが事前作業としてラベリングを実施した、話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを、ラベルイメージサーバ２００に送信する。
ラベルイメージは、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (5) Label image transmission of speaker names “A” and “ka + i” The label image transmission unit 102 performs labeling as a pre-operation by the operator, and includes a phoneme label string “k−” by the utterance of the speaker “A”. The label image corresponding to “a + i” is transmitted to the label image server 200.
The label image reaches the label image server 200 as a packet transmitted through the network 400.

（６）「ｋ−ａ＋ｉ」のラベルイメージ登録
ラベルイメージ登録部２０２は、ステップ（５）でラベルイメージ送信部１０２が送信したラベルイメージパケットを受け取る。
次に、ラベルイメージ登録部２０２は、受け取ったラベルイメージをラベルイメージＤＢ２０４ａに登録する。登録の際には、図９の「話者名」列の値を「Ａ」、「ラベル列」列の値を「ｋ−ａ＋ｉ」とするエントリを新たに生成し、受け取ったラベルイメージを「ラベルイメージ」列に格納する。 (6) Label image registration of “k−a + i” The label image registration unit 202 receives the label image packet transmitted by the label image transmission unit 102 in step (5).
Next, the label image registration unit 202 registers the received label image in the label image DB 204a. At the time of registration, a new entry having a value “A” in the “speaker name” column and a value “ka−i” in the “label column” column in FIG. Store in the "Label Image" column.

図１１は、本実施の形態３において、ラベリング作業端末３００のオペレータが、図５で説明したような画面上でラベリング作業を行う際に、ラベリング作業端末３００が実行する内部的な処理シーケンスを説明するものである。
以下、各ステップについて説明する。 FIG. 11 illustrates an internal processing sequence executed by the labeling work terminal 300 when the operator of the labeling work terminal 300 performs the labeling work on the screen as described in FIG. 5 in the third embodiment. To do.
Hereinafter, each step will be described.

（１）話者名「Ａ」と「ｋ−ａ＋ｉ」のラベル列を送信
オペレータは、図示しないラベリング作業端末３００の操作部を操作し、ラベルイメージサーバ２００に対し、話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを送信するよう要求する。
ラベルイメージ検索依頼部３０１は、上述の操作指示を受けて、話者名「Ａ」と音素ラベル列「ｋ−ａ＋ｉ」をラベルイメージサーバ２００に送信し、話者「Ａ」の発声による音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを検索するよう要求する。
検索要求は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。
なお、本ステップは、図５で説明したステップ（３）における、ラベリング作業端末３００の内部動作に相当する。なお、この場合、図５の画面の「５」の部分において、「話者名」を入力する欄を設けておく。 (1) Transmitting the label string of the speaker names “A” and “ka + i” The operator operates the operation unit of the labeling work terminal 300 (not shown) and utters the speaker “A” to the label image server 200. Is requested to transmit a label image corresponding to the phoneme label string “ka−i + i”.
In response to the above operation instruction, the label image search request unit 301 transmits the speaker name “A” and the phoneme label string “ka−i + i” to the label image server 200, and the phoneme label generated by the utterance of the speaker “A”. Request to retrieve the label image corresponding to the column “k−a + i”.
The search request reaches the label image server 200 as a packet transmitted through the network 400.
This step corresponds to the internal operation of the labeling work terminal 300 in step (3) described with reference to FIG. In this case, a field for inputting “speaker name” is provided in the portion “5” of the screen of FIG.

（２）話者名「Ａ」で「ｋ−ａ＋ｉ」のラベル列を検索する
ラベルイメージ検索部２０３は、ステップ（１）でラベルイメージ検索依頼部３０１が送信した要求パケットを受け取る。
次に、ラベルイメージ検索部２０３は、受け取った話者名「Ａ」とラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、対応するラベルイメージを取得する。対応するラベルイメージがラベルイメージＤＢ２０４ａに登録されていない場合には、デフォルトラベルイメージＤＢ２０５ａが保持しているラベルイメージの中で、話者名が「Ａ」であり、音素ラベル列が「ｋ−ａ＋ｉ」と同一、又はこれと最も近いものを検索し、該当するラベルイメージを取得する。 (2) Search for a label string of “ka + i” with the speaker name “A” The label image search unit 203 receives the request packet transmitted by the label image search request unit 301 in step (1).
Next, the label image search unit 203 searches the label image DB 204a using the received speaker name “A” and the label string “ka + i” as keys, and acquires a corresponding label image. When the corresponding label image is not registered in the label image DB 204a, the speaker name is “A” and the phoneme label string is “k−a + i” among the label images held in the default label image DB 205a. ”Is the same as or closest to this, and the corresponding label image is obtained.

（４）表示依頼
ラベルイメージ検索依頼部３０１は、表示部３０２に対し、取得したラベルイメージを画面表示するように依頼する。
表示部３０２は、ラベルイメージ検索依頼部３０１が取得したラベルイメージを画面表示する。 (4) Display Request The label image search request unit 301 requests the display unit 302 to display the acquired label image on the screen.
The display unit 302 displays the label image acquired by the label image search request unit 301 on the screen.

以上のように、本実施の形態３によれば、ラベリング作業を行う音声の話者毎にラベルイメージを保持しておくことにより、実際にラベリング作業を行う音声のラベルイメージに近いラベルイメージを参照しながらラベリング作業を実施できるので、より精度の高いラベル位置の設定を、ラベラー間のノウハウ等の差異によらず統一的に行うことが可能となる。 As described above, according to the third embodiment, a label image close to the label image of the voice actually performing the labeling work is referred to by holding the label image for each speaker of the voice performing the labeling work. Since the labeling operation can be performed, it is possible to set the label position with higher accuracy in a unified manner regardless of the difference in know-how between the labelers.

実施の形態４．
本発明の実施の形態４では、ラベルイメージサーバ２００において、ラベルイメージ毎にメルケプストラム情報を格納しており、ラベルイメージの検索の際に、このメルケプストラム情報を用いる構成例について説明する。
なお、本実施の形態４に係る音声ラベリング支援システムの構成は、ラベルイメージＤＢ２０４ａの構成を除いて実施の形態１で説明した図１と同様であるため、説明を省略する。 Embodiment 4 FIG.
In the fourth embodiment of the present invention, a mel cepstrum information is stored for each label image in the label image server 200, and a configuration example using this mel cepstrum information when searching for a label image will be described.
The configuration of the audio labeling support system according to the fourth embodiment is the same as that of FIG. 1 described in the first embodiment except for the configuration of the label image DB 204a, and thus the description thereof is omitted.

なお、「メルケプストラム」とは、音程の感覚を表す尺度であるメルスケール上での対数パワースペクトルの逆フーリエ変換として定義されるものである。一般に、メルケプストラムを用いることにより、聴覚特性に合わせた情報圧縮が可能となる。
本実施の形態４において、メルケプストラム情報は、対応するラベル列の波形データ区間で５ｍｓ間隔で抽出され、各ラベル区間の４等分で平均化する。これらの数値は設計上のものであり、設計者が適宜設定すればよい。 “Mel cepstrum” is defined as an inverse Fourier transform of a logarithmic power spectrum on a mel scale, which is a scale representing a sense of pitch. In general, by using a mel cepstrum, it is possible to compress information in accordance with auditory characteristics.
In the fourth embodiment, the mel cepstrum information is extracted at 5 ms intervals in the waveform data section of the corresponding label sequence, and averaged at four equal parts in each label section. These numerical values are designed, and the designer may set them appropriately.

本実施の形態４では、ラベルイメージＤＢ２０４ａから該当ラベルイメージを検索する際に、検索ラベル列と併用してメルケプストラム情報を検索条件に用いる。
これにより、複数のエントリが検索条件に合致した場合や、検索条件に合致するエントリが全く存在しない場合であっても、メルケプストラム情報が最も近いラベルイメージを取得することができるので、ラベリング作業端末３００でラベリング作業を行う際に参照するに適したラベルイメージを確実に取得することができる。 In the fourth embodiment, when searching for the corresponding label image from the label image DB 204a, the mel cepstrum information is used as a search condition in combination with the search label string.
As a result, even when a plurality of entries match the search condition or when there is no entry that matches the search condition, the label image with the closest mel cepstrum information can be acquired. A label image suitable for reference when performing a labeling operation at 300 can be reliably acquired.

以下は、本実施の形態４の構成と動作について説明する。 The configuration and operation of the fourth embodiment will be described below.

図１２は、本実施の形態４におけるラベルイメージＤＢ２０４ａの構成とデータ例を示すものである。
図１２において、図３で説明した構成に加え、新たに「メルケプストラム」列が追加されている。
「ラベル列」列には、音素ラベル列が格納されている。
「ラベルイメージ」列には、「ラベル列」列の値で特定される音素ラベル列に対応したラベルイメージが格納されている。
「メルケプストラム」列には、「ラベルイメージ」列に格納されているラベルイメージに対応した波形データより算出したメルケプストラム情報が格納されている。ここでは、各ラベルイメージ毎に１２個のメルケプストラム情報を格納している例を示しているが、メルケプストラム情報の個数はこれに限られるものではない。 FIG. 12 shows the configuration and data example of the label image DB 204a according to the fourth embodiment.
In FIG. 12, in addition to the configuration described in FIG. 3, a “Mel cepstrum” column is newly added.
The “label string” column stores a phoneme label string.
The “label image” column stores a label image corresponding to the phoneme label column specified by the value of the “label column” column.
The “mel cepstrum” column stores the mel cepstrum information calculated from the waveform data corresponding to the label image stored in the “label image” column. Here, an example is shown in which twelve pieces of mel cepstrum information are stored for each label image, but the number of pieces of mel cepstrum information is not limited to this.

図１３は、ラベルイメージ登録端末１００がラベルイメージを登録する際の処理シーケンスを説明するものである。ここでは、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを登録する際のシーケンスを例に取る。
以下、各ステップについて説明する。 FIG. 13 illustrates a processing sequence when the label image registration terminal 100 registers a label image. Here, a sequence when registering a label image corresponding to the phoneme label string “ka−i + i” is taken as an example.
Hereinafter, each step will be described.

（０）事前作業〜（３）ラベル列の検索結果
ステップ（０）〜（３）は、実施の形態１の図４で説明したものと同様であるため、説明を省略する。 (0) Preliminary work to (3) Label row search results Steps (0) to (3) are the same as those described in FIG.

（４）送信依頼
ラベル列送信部１０１は、ラベルイメージ送信部１０２に対し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージと、その波形データより求めたメルケプストラム情報を、ラベルイメージサーバ２００に送信するように依頼する。 (4) Transmission Request The label sequence transmission unit 101 sends the label image corresponding to the phoneme label sequence “k−a + i” and the mel cepstrum information obtained from the waveform data to the label image transmission unit 102. Ask to send to.

（５）「ｋ−ａ＋ｉ」のラベルイメージとメルケプストラム情報を送信
ラベルイメージ送信部１０２は、オペレータが事前作業としてラベリングを実施した、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを、ラベルイメージサーバ２００に送信する。
また、音声波形データより、そのラベルイメージに対応するメルケプストラム情報を求め、ラベルイメージとともにラベルイメージサーバ２００に送信する。
ラベルイメージとメルケプストラム情報は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (5) Transmit “k−a + i” label image and mel cepstrum information The label image transmission unit 102 labels the label image corresponding to the phoneme label string “ka−i”, which has been labeled as a preliminary operation by the operator. It transmits to the image server 200.
Further, the mel cepstrum information corresponding to the label image is obtained from the voice waveform data, and transmitted to the label image server 200 together with the label image.
The label image and the mel cepstrum information reach the label image server 200 as a packet transmitted through the network 400.

（６）「ｋ−ａ＋ｉ」のラベルイメージとメルケプストラム情報を登録
ラベルイメージ登録部２０２は、ステップ（５）でラベルイメージ送信部１０２が送信したパケットを受け取る。
次に、ラベルイメージ登録部２０２は、受け取ったラベルイメージとメルケプストラム情報を、ラベルイメージＤＢ２０４ａに登録する。 (6) Register “k−a + i” label image and mel cepstrum information The label image registration unit 202 receives the packet transmitted by the label image transmission unit 102 in step (5).
Next, the label image registration unit 202 registers the received label image and mel cepstrum information in the label image DB 204a.

なお、図１３において、登録するラベルイメージに対応するメルケプストラム情報はラベルイメージ登録端末１００が求めているが、これに代えて波形データをラベルイメージサーバ２００に送信し、ラベルイメージサーバ２００でメルケプストラム情報を求めて登録するように構成してもよい。 In FIG. 13, the mel cepstrum information corresponding to the label image to be registered is obtained by the label image registration terminal 100. Instead, the waveform data is transmitted to the label image server 200, and the label image server 200 sends the mel cepstrum information. Information may be obtained and registered.

図１４は、本実施の形態４において、ラベリング作業端末３００のオペレータが、図５で説明したような画面上でラベリング作業を行う際に、ラベリング作業端末３００が実行する内部的な処理シーケンスを説明するものである。
以下、各ステップについて説明する。 FIG. 14 illustrates an internal processing sequence executed by the labeling work terminal 300 when the operator of the labeling work terminal 300 performs the labeling work on the screen as described in FIG. 5 in the fourth embodiment. To do.
Hereinafter, each step will be described.

（１）「ｋ−ａ＋ｉ」のラベル列とメルケプストラム情報を送信
オペレータは、図示しないラベリング作業端末３００の操作部を操作し、ラベルイメージサーバ２００に対し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを送信するよう要求する。
ラベルイメージ検索依頼部３０１は、上述の操作指示を受けて、音素ラベル列「ｋ−ａ＋ｉ」に対応する波形データよりメルケプストラム情報を求める。
次に、音素ラベル列「ｋ−ａ＋ｉ」とともに、そのメルケプストラム情報をラベルイメージサーバ２００に送信し、音素ラベル列「ｋ−ａ＋ｉ」とそのメルケプストラム情報に対応するラベルイメージを検索するよう要求する。
検索要求は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (1) Transmit “k−a + i” label sequence and mel cepstrum information The operator operates an operation unit of the labeling work terminal 300 (not shown) to correspond to the phoneme label sequence “ka−i” to the label image server 200. Request to send the label image to be sent.
In response to the operation instruction described above, the label image search request unit 301 obtains mel cepstrum information from waveform data corresponding to the phoneme label string “ka−i + i”.
Next, the mel cepstrum information is transmitted to the label image server 200 together with the phoneme label string “k−a + i”, and a request is made to search for the phoneme label string “k−a + i” and the label image corresponding to the mel cepstrum information. .
The search request reaches the label image server 200 as a packet transmitted through the network 400.

（２）「ｋ−ａ＋ｉ」のラベル列とメルケプストラム情報を検索する
ラベルイメージ検索部２０３は、ステップ（１）でラベルイメージ検索依頼部３０１が送信した要求パケットを受け取る。
次に、ラベルイメージ検索部２０３は、受け取ったラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、該当するエントリを取得する。ここでは、複数のエントリが検索にヒットしたものとする。
さらに、ラベルイメージ検索部２０３は、「ｋ−ａ＋ｉ」をキーとする検索で得たエントリが保持するメルケプストラム情報と、ラベリング作業端末３００より受け取ったメルケプストラム情報との距離を算出し、最も距離の小さいエントリを特定し、そのエントリのラベルイメージを取得する。 (2) Searching for the label string of “k−a + i” and mel cepstrum information The label image search unit 203 receives the request packet transmitted by the label image search request unit 301 in step (1).
Next, the label image search unit 203 searches the label image DB 204a using the received label string “k−a + i” as a key, and acquires a corresponding entry. Here, it is assumed that a plurality of entries hit the search.
Furthermore, the label image search unit 203 calculates the distance between the mel cepstrum information held by the entry obtained by the search using “ka + i” as a key and the mel cepstrum information received from the labeling work terminal 300, and the most distance An entry having a small size is specified, and a label image of the entry is acquired.

（３）「ｋ−ａ＋ｉ」のラベルイメージ送信〜（４）表示依頼
ステップ（３）〜（４）は、実施の形態１の図６で説明したステップ（３）〜（４）と同様であるため、説明を省略する。 (3) Label image transmission of “k−a + i” to (4) display request Steps (3) to (4) are the same as steps (3) to (4) described in FIG. 6 of the first embodiment. Therefore, the description is omitted.

なお、ステップ（２）において、ラベル列「ｋ−ａ＋ｉ」に該当するエントリが複数存在する場合について説明したが、先に述べたように、該当するエントリが全く存在しない場合であっても、同様にラベリング作業端末３００より受け取ったメルケプストラム情報との距離が最も小さいエントリを検索するようにしてもよい。
また、ラベル列とメルケプストラム情報を検索条件として併用することとしたが、メルケプストラム情報単独で検索条件にしてもよい。
いずれの場合であっても、メルケプストラム情報を用いることにより、ラベリング作業を行う音声の特徴に近いラベルイメージを取得することができる。 In step (2), the case where there are a plurality of entries corresponding to the label string “k−a + i” has been described. However, as described above, even if there is no corresponding entry at all, the same applies. Alternatively, the entry having the shortest distance from the mel cepstrum information received from the labeling work terminal 300 may be searched.
Further, the label string and the mel cepstrum information are used together as search conditions, but the mel cepstrum information alone may be used as the search conditions.
In any case, by using the mel cepstrum information, it is possible to acquire a label image that is close to the feature of the voice that performs the labeling operation.

本実施の形態４の図１２において、ラベルイメージＤＢ２０４ａの構成を説明したが、デフォルトラベルイメージＤＢ２０５ａについても同様の構成を備えることができる。
また、図１２において、図３の構成に「メルケプストラム」列を追加した構成を例示したが、図９の構成に「メルケプストラム」列を追加した構成であっても、本実施の形態４による効果に差異はない。 Although the configuration of the label image DB 204a has been described in FIG. 12 of the fourth embodiment, the default label image DB 205a can have the same configuration.
12 illustrates the configuration in which the “Mel cepstrum” column is added to the configuration in FIG. 3, but the configuration in which the “Mel cepstrum” column is added to the configuration in FIG. There is no difference in effect.

以上のように、本実施の形態４によれば、ラベルイメージＤＢ２０４ａにメルケプストラム情報を保持しておき、ラベルイメージを検索する際にメルケプストラム情報を用いるように構成したので、複数のエントリが検索条件に合致した場合や、検索条件に合致するエントリが全く存在しない場合であっても、メルケプストラム情報が最も近いラベルイメージを取得することができ、ラベリング作業を行う際に参照するに適したラベルイメージを確実に取得することができる。 As described above, according to the fourth embodiment, since the mel cepstrum information is stored in the label image DB 204a and the mel cepstrum information is used when searching for the label image, a plurality of entries are searched. Even if the conditions are met or there are no entries that match the search conditions, the label image with the closest mel cepstrum information can be acquired, and the label is suitable for reference when performing labeling work. The image can be acquired reliably.

実施の形態５．
本発明の実施の形態５では、ラベルイメージサーバ２００でメルケプストラム情報を算出する機能を備えた構成について説明する。これにより、ラベルイメージを検索する際の処理負荷をラベルイメージサーバ２００に集約することを図る。
なお、本実施の形態５に係る音声ラベリング支援システムの構成は、実施の形態４で説明したものと同様であるため、説明を省略する。 Embodiment 5. FIG.
In the fifth embodiment of the present invention, a configuration having a function of calculating mel cepstrum information by the label image server 200 will be described. As a result, the processing load for retrieving the label image is collected in the label image server 200.
The configuration of the voice labeling support system according to the fifth embodiment is the same as that described in the fourth embodiment, and thus the description thereof is omitted.

図１５は、本実施の形態５において、ラベリング作業端末３００のオペレータが、図５で説明したような画面上でラベリング作業を行う際に、ラベリング作業端末３００が実行する内部的な処理シーケンスを説明するものである。
以下、各ステップについて説明する。 FIG. 15 illustrates an internal processing sequence executed by the labeling work terminal 300 when the operator of the labeling work terminal 300 performs the labeling work on the screen as described in FIG. 5 in the fifth embodiment. To do.
Hereinafter, each step will be described.

（１）「ｋ−ａ＋ｉ」のラベル列と波形データを送信
オペレータは、図示しないラベリング作業端末３００の操作部を操作し、ラベルイメージサーバ２００に対し、音素ラベル列「ｋ−ａ＋ｉ」に対応するラベルイメージを送信するよう要求する。
ラベルイメージ検索依頼部３０１は、上述の操作指示を受けて、音素ラベル列「ｋ−ａ＋ｉ」とともに、その波形データをラベルイメージサーバ２００に送信し、音素ラベル列「ｋ−ａ＋ｉ」とそのメルケプストラム情報に対応するラベルイメージを検索するよう要求する。
検索要求は、ネットワーク４００を伝送するパケットとして、ラベルイメージサーバ２００に到達する。 (1) Transmit “k−a + i” label sequence and waveform data The operator operates an operation unit of a labeling work terminal 300 (not shown) to correspond to the phoneme label sequence “k−a + i” to the label image server 200. Request to send a label image.
Upon receiving the above operation instruction, the label image search request unit 301 transmits the waveform data together with the phoneme label string “ka−i +” to the label image server 200, and the phoneme label string “ka−i +” and its mel cepstrum. Requests to retrieve the label image corresponding to the information.
The search request reaches the label image server 200 as a packet transmitted through the network 400.

（２）メルケプストラム情報を求める
ラベルイメージ検索部２０３は、ステップ（１）でラベルイメージ検索依頼部３０１が送信した要求パケットを受け取る。
次に、ラベルイメージ検索部２０３は、受け取った波形データより、メルケプストラム情報を算出する。 (2) Obtaining Mel Cepstrum Information The label image search unit 203 receives the request packet transmitted by the label image search request unit 301 in step (1).
Next, the label image search unit 203 calculates mel cepstrum information from the received waveform data.

（３）「ｋ−ａ＋ｉ」のラベル列とメルケプストラム情報を検索する
ラベルイメージ検索部２０３は、受け取ったラベル列「ｋ−ａ＋ｉ」をキーにしてラベルイメージＤＢ２０４ａを検索し、該当するエントリを取得する。ここでは、複数のエントリが検索にヒットしたものとする。
さらに、ラベルイメージ検索部２０３は、「ｋ−ａ＋ｉ」をキーとする検索で得たエントリが保持するメルケプストラム情報と、ステップ（２）で算出したメルケプストラム情報との距離を算出し、最も距離の小さいエントリを特定し、そのエントリのラベルイメージを取得する。 (3) Search for the label string and mel cepstrum information of “ka + i” The label image search unit 203 searches the label image DB 204a using the received label string “ka + i” as a key, and acquires the corresponding entry. To do. Here, it is assumed that a plurality of entries hit the search.
Further, the label image search unit 203 calculates the distance between the mel cepstrum information held in the entry obtained by the search using “ka + i” as a key and the mel cepstrum information calculated in step (2), and the most distance is obtained. An entry having a small size is specified, and a label image of the entry is acquired.

（４）「ｋ−ａ＋ｉ」のラベルイメージ送信〜（５）表示依頼
ステップ（４）〜（５）は、実施の形態１の図６で説明したステップ（３）〜（４）と同様であるため、説明を省略する。 (4) “k−a + i” label image transmission to (5) display request Steps (4) to (5) are the same as steps (3) to (4) described in FIG. 6 of the first embodiment. Therefore, the description is omitted.

なお、ステップ（２）において、ラベル列「ｋ−ａ＋ｉ」に該当するエントリが複数存在する場合について説明したが、実施の形態４と同様に、該当するエントリが全く存在しない場合であっても、同様にラベリング作業端末３００より受け取ったメルケプストラム情報との距離が最も小さいエントリを検索するようにしてもよい。
また、ラベル列とメルケプストラム情報を検索条件として併用することとしたが、実施の形態４と同様に、メルケプストラム情報単独で検索条件にしてもよい。 In addition, although the case where there are a plurality of entries corresponding to the label string “k−a + i” in step (2) has been described, as in the fourth embodiment, even if there is no corresponding entry at all, Similarly, an entry having the shortest distance from the mel cepstrum information received from the labeling work terminal 300 may be searched.
In addition, the label string and the mel cepstrum information are used together as search conditions. However, similarly to the fourth embodiment, the mel cepstrum information alone may be used as a search condition.

以上のように、本実施の形態５によれば、ラベリング作業端末３００がラベルイメージを取得する際に、メルケプストラム情報を求める処理をラベルイメージサーバ２００で実行しているので、演算負荷をラベルイメージサーバ２００に集約し、ラベリング作業端末３００のＣＰＵやメモリ等を小型化することができる。
また、演算負荷をラベルイメージサーバ２００に集約することは、投資対象を集約することにもなるため、サーバ資産等の管理の観点からも好ましい。 As described above, according to the fifth embodiment, when the labeling work terminal 300 acquires the label image, the label image server 200 executes the process for obtaining the mel cepstrum information. The CPU 200 and the memory of the labeling work terminal 300 can be reduced in size by concentrating on the server 200.
In addition, it is preferable from the viewpoint of management of server assets and the like to consolidate the calculation load in the label image server 200 because it also consolidates investment targets.

なお、以上の実施の形態１〜５において、ラベルイメージ登録端末１００とラベルイメージサーバ２００の間の通信方式や、ラベリング作業端末３００とラベルイメージサーバ２００の間の通信方式については、特に言及していないが、任意の方式を用いることができる。
例えば、ＴＣＰ上の任意のポートを用いてデータやコマンドの送受信を行うクライアント・サーバ型のシステムとして構成してもよいし、ラベルイメージサーバ２００にＷｅｂサーバの機能を備えさせておき、さらにラベルイメージ登録端末１００とラベリング作業端末３００にＷｅｂブラウザ機能を備えさせて、Ｗｅｂアプリケーションとして構成してもよい。 In the first to fifth embodiments, the communication method between the label image registration terminal 100 and the label image server 200 and the communication method between the labeling work terminal 300 and the label image server 200 are particularly referred to. There is no, but any scheme can be used.
For example, it may be configured as a client-server type system that transmits and receives data and commands using an arbitrary port on TCP, or the label image server 200 is provided with a Web server function, and further a label image The registration terminal 100 and the labeling work terminal 300 may be provided with a Web browser function and configured as a Web application.

また、ラベルイメージＤＢ２０４ａは記憶手段２０４に、デフォルトラベルイメージ２０５ａはデフォルトラベルイメージ記憶手段２０５に、それぞれ格納されていることを説明したが、格納形式は適宜最適なものを用いればよい。
一例として、それぞれの記憶手段にデータファイルを格納するＤＢＭＳ（ＤａｔａｂａｓｅＭａｎａｇｅｍｅｎｔＳｙｓｔｅｍ）をラベルイメージサーバ２００上に構成し、ＤＢＭＳの配下で図３、図９、図１２のようなテーブル形式のデータ構造を定義し、各行に同各図で説明したようなデータエントリを格納するものとすることができる。
また、ラベルイメージの画像データサイズが大きい場合には、ラベルイメージを画像ファイルとしてＤＢとは別個に格納し、「ラベルイメージ」列にはそのファイルパスのみを保持するようにしてもよい。 In addition, although it has been described that the label image DB 204a is stored in the storage unit 204 and the default label image 205a is stored in the default label image storage unit 205, an optimal storage format may be used as appropriate.
As an example, a database management system (DBMS) that stores data files in each storage means is configured on the label image server 200, and the data structure in the table format as shown in FIGS. 3, 9, and 12 under the DBMS. It is possible to define and store a data entry as described in each figure in each row.
When the image data size of the label image is large, the label image may be stored as an image file separately from the DB, and only the file path may be held in the “label image” column.

実施の形態１に係る音声ラベリング支援システムの構成を示すものである。1 shows a configuration of an audio labeling support system according to Embodiment 1. ラベルイメージの１例を示すものである。An example of a label image is shown. ラベルイメージＤＢ２０４ａの構成とデータ例を示すものである。It shows the configuration and data example of the label image DB 204a. ラベルイメージ登録端末１００がラベルイメージを登録する際の処理シーケンスを説明するものである。A processing sequence when the label image registration terminal 100 registers a label image will be described. ラベリング作業端末３００の表示部３０２に表示される、ラベリング作業画面の構成例である。It is a structural example of a labeling work screen displayed on the display unit 302 of the labeling work terminal 300. 実施の形態１におけるラベリング作業端末３００の内部的な処理シーケンスを説明するものである。An internal processing sequence of the labeling work terminal 300 in the first embodiment will be described. 実施の形態２に係る音声ラベリング支援システムの構成である。It is a structure of the audio | voice labeling assistance system which concerns on Embodiment 2. FIG. 実施の形態２におけるラベリング作業端末３００の内部的な処理シーケンスを説明するものである。An internal processing sequence of the labeling work terminal 300 according to the second embodiment will be described. 実施の形態３におけるラベルイメージＤＢ２０４ａの構成とデータ例を示すものである。The structure and data example of label image DB204a in Embodiment 3 are shown. ラベルイメージ登録端末１００がラベルイメージを登録する際の処理シーケンスを説明するものである。A processing sequence when the label image registration terminal 100 registers a label image will be described. 実施の形態３におけるラベリング作業端末３００の内部的な処理シーケンスを説明するものである。An internal processing sequence of the labeling work terminal 300 according to the third embodiment will be described. 実施の形態４におけるラベルイメージＤＢ２０４ａの構成とデータ例を示すものである。The structure and data example of label image DB204a in Embodiment 4 are shown. ラベルイメージ登録端末１００がラベルイメージを登録する際の処理シーケンスを説明するものである。A processing sequence when the label image registration terminal 100 registers a label image will be described. 実施の形態４におけるラベリング作業端末３００の内部的な処理シーケンスを説明するものである。An internal processing sequence of the labeling work terminal 300 according to the fourth embodiment will be described. 実施の形態５におけるラベリング作業端末３００の内部的な処理シーケンスを説明するものである。An internal processing sequence of the labeling work terminal 300 according to the fifth embodiment will be described.

Explanation of symbols

１００ラベルイメージ登録端末、１０１ラベル列送信部、１０２ラベルイメージ送信部、２００ラベルイメージサーバ、２０１ラベル登録判定部、２０２ラベルイメージ登録部、２０３ラベルイメージ検索部、２０４記憶手段、２０４ａラベルイメージＤＢ、２０５デフォルトラベルイメージ記憶手段、２０５ａデフォルトラベルイメージＤＢ、３００ラベリング作業端末、３０１ラベルイメージ検索依頼部、３０２表示部、４００ネットワーク。 DESCRIPTION OF SYMBOLS 100 Label image registration terminal, 101 Label row transmission part, 102 Label image transmission part, 200 Label image server, 201 Label registration determination part, 202 Label image registration part, 203 Label image search part, 204 Storage means, 204a Label image DB, 205 default label image storage means, 205a default label image DB, 300 labeling work terminal, 301 label image search request unit, 302 display unit, 400 network.

Claims

A label image server comprising storage means for storing a label image DB for holding a label image for each label row of the phonemic environment;
A work terminal for labeling work,
An audio labeling support system comprising:
The label image server
A search unit that receives a search label string and searches the label image DB for a corresponding label image;
The work terminal is
A search request unit that requests the search unit to search for a label image;
A display for displaying the label image on the screen;
With
The search request unit
Send the label sequence of the audio to be labeled to the label image server,
The search unit
A label string is received from the search request unit, an entry corresponding to the label string is searched from the label image DB, and a label image associated with the label string is returned to the work terminal,
The search request unit
An audio labeling support system that receives a label image from the search unit and causes the display unit to display the label image on a screen.

The storage means
Stores the specified label image DB generated based on the utterance of the specified speaker,
The search unit
If the entry corresponding to the label string received from the search request unit does not exist in the label image DB,
Search the label image DB corresponding to the label column closest to the label column received from the search request unit from the specified label image DB,
The voice labeling support system according to claim 1, wherein the label image is returned to the work terminal.

The label image DB is
Holding the label image generated based on the utterances of multiple speakers,
The search request unit
Along with the label sequence of the voice that performs the labeling operation, information for identifying the speaker is sent to the label image server,
The search unit
Receive information for identifying a label string and a speaker from the search request unit,
Search the label image DB for an entry corresponding to the label string and the speaker,
The voice labeling support system according to claim 1 or 2, wherein a label image associated with the label string and a speaker is returned to the work terminal.

The label image DB is
Holds mel cepstrum information for each label image,
The search request unit
Find the mel cepstrum information of the voice that performs the labeling work, send the result to the label image server,
The search unit
Receive mel cepstrum information from the search request section,
For the mel cepstrum information held in the label image DB, calculate the distance from the mel cepstrum information transmitted by the search request unit,
The mel cepstrum information with the shortest distance is searched from the label image DB, and the label image corresponding to the mel cepstrum information is returned to the work terminal. Voice labeling support system.

The label image DB is
Holds mel cepstrum information for each label image,
The search request unit
Send the waveform data of the voice to be labeled to the label image server,
The search unit
Receiving voice waveform data from the search request unit, obtaining mel cepstrum information from the waveform data,
For the mel cepstrum information held in the label image DB, calculate the distance from the mel cepstrum information obtained by the search unit,
4. The mel cepstrum information with the shortest distance is searched from the label image DB, and a label image corresponding to the mel cepstrum information is returned to the work terminal. Voice labeling support system.