JP3534711B2

JP3534711B2 - Audio editing device and audio editing program

Info

Publication number: JP3534711B2
Application number: JP2001101223A
Authority: JP
Inventors: 治笠井
Original assignee: 株式会社コナミコンピュータエンタテインメント東京
Priority date: 2001-03-30
Filing date: 2001-03-30
Publication date: 2004-06-07
Anticipated expiration: 2021-03-30
Also published as: JP2002297187A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声編集装置及び
音声編集プログラムに関し、特に、編集対象となる音声
波形に含まれる各音声部分に対応づけて、収録対象音声
か非収録対象音声かの推定結果を表示することにより、
音声編集の効率を上げる技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio editing apparatus and an audio editing program, and more particularly, to estimate whether a recording target voice or a non-recording target voice is associated with each voice portion included in a voice waveform to be edited. By displaying the results,
A technique for improving the efficiency of voice editing.

【０００２】[0002]

【従来の技術】音楽、会話、アニメーションやゲームに
おける台詞等は、録音スタジオ等で収録された後、ディ
ジタルデータ化され、音声編集プログラムによって編集
・加工されることが多い。音声編集プログラムでは、デ
ィジタル形式の音声データ（波形データ）に基づいて、
収録音声の波形をコンピュータディスプレイに表示する
ようになっており、編集者は、表示画面上で波形位置
（音声タイミング）又は波形範囲（音声区間）をマウス
等のポインティングデバイスで指定して、その指定した
波形位置以降の収録音声、又は波形範囲の収録音声を音
声出力させることができるようになっている。そして、
波形の内容（音声内容）を適宜確認しながら、任意の波
形位置又は波形範囲をマウス等で指定し、さらにカッ
ト、コピー、ペースト、各種サウンドエフェクト付加等
の編集方法を指定することにより、収録音声の編集作業
を進めるようになっている。2. Description of the Related Art Music, conversation, animation and dialogue in games are often recorded in a recording studio or the like, converted into digital data, and edited / processed by a voice editing program. In the voice editing program, based on digital format voice data (waveform data),
The waveform of the recorded voice is displayed on the computer display, and the editor specifies the waveform position (voice timing) or waveform range (voice section) on the display screen with a pointing device such as a mouse, and then the designation. The recorded voice after the waveform position or the recorded voice in the waveform range can be output as voice. And
While confirming the waveform contents (audio contents) appropriately, specify the arbitrary waveform position or waveform range with the mouse etc., and further specify the editing method such as cutting, copying, pasting, adding various sound effects, etc. It is designed to proceed with the editing work of.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記従
来の音声編集プログラムでは、表示画面上で波形の位置
又は範囲を指定して、指定位置からの音声内容、又は指
定範囲の音声内容をいちいち耳で確認しなければ、その
音声が収録対象であるかどうか判断できないという問題
がある。すなわち、収録音声には、収録対象となる音声
（収録対象音声（編集対象音声））とそれ以外の音声
（非収録対象音声（非編集対象音声））とが含まれる
が、音声内容をいちいち耳で確認しなければ、音声波形
に含まれる各音声部分が収録対象音声に係るものか非収
録対象音声に係るものかを一切判断できないのでは、編
集効率が悪い。なお、収録対象音声は、例えば歌手の歌
声、会議参加者の発言、声優の喋る台詞等である。一
方、非収録対象音声は、例えば歌手、会議参加者、声優
等の収録対象者の独り言や、スタジオスタッフの声等で
ある。However, in the above-mentioned conventional audio editing program, the position or range of the waveform is specified on the display screen, and the audio content from the specified position or the audio content of the specified range is listened to. Without confirmation, there is a problem that it cannot be determined whether or not the sound is a recording target. That is, the recorded sound includes a sound to be recorded (recording target sound (editing target sound)) and other sounds (non-recording target sound (non-editing target sound)). If it is not possible to determine whether each voice portion included in the voice waveform is related to the recording target voice or the non-recording target voice, the editing efficiency is poor unless the above is confirmed. The recording target voice is, for example, a singing voice of a singer, a speech of a conference participant, a dialogue of a voice actor, or the like. On the other hand, the non-recording target voice is, for example, the soliloquy of the recording target person such as a singer, a conference participant, a voice actor, or the voice of the studio staff.

【０００４】本発明は上記課題に鑑みてなされたもので
あって、その目的は、編集対象となる音声波形に含まれ
る各音声部分に関し、収録対象音声か非収録対象音声か
の判断を助ける表示をすることにより、音声編集の効率
を上げることができる音声編集装置及び音声編集プログ
ラムを提供することにある。The present invention has been made in view of the above problems, and an object thereof is to provide a display for assisting determination of recording target voice or non-recording target voice with respect to each voice portion included in a voice waveform to be edited. By providing a voice editing apparatus and a voice editing program, it is possible to improve the efficiency of voice editing.

【０００５】[0005]

【課題を解決するための手段】上記課題を解決するため
に、本発明に係る音声編集装置は、音声の波形を表示す
る波形表示手段と、前記波形に含まれる音声部分を判断
する音声部分判断手段と、前記波形に含まれる各音声部
分が、収録対象音声と非収録対象音声のいずれに係るも
のかを推定する推定手段と、前記波形に含まれる各音声
部分に対応づけて、前記収録対象音声推定手段による推
定結果を表示する推定結果表示手段と、を含むことを特
徴とする。In order to solve the above problems, a voice editing apparatus according to the present invention comprises a waveform display means for displaying a waveform of a voice and a voice portion determination for determining a voice portion included in the waveform. Means, estimating means for estimating whether each voice part included in the waveform relates to a recording target voice or a non-recording target voice, and the recording target in association with each voice part included in the waveform. Estimation result display means for displaying the estimation result by the voice estimation means.

【０００６】本発明によれば、表示手段に音声の波形が
表示されるとともに、その波形に含まれる各音声部分に
対応づけて、その音声部分が収録対象音声と非収録対象
音声のいずれに係るものかの推定結果が表示される。こ
うすれば、編集者は表示内容により、各音声部分が収録
対象音声と非収録対象音声のいずれに係るものかを容易
に判断できるようになり、編集効率を向上させることが
できる。なお、推定結果を表示する場合、各音声部分が
収録対象音声と非収録対象音声のいずれに係るものと推
定されるかを直接的に表示するようにしてもよいし、各
音声部分又はそれに対応する表示の態様等により間接的
に表示するようにしてもよい。また、各音声部分が収録
対象音声と非収録対象音声のいずれに係るものに近いか
を段階表示（例えば、収録対象音声に近い、非収録対象
音声に近い、いずれとも言えない、等）するものであっ
てもよい。According to the present invention, the waveform of the voice is displayed on the display means, and the voice portion is associated with each of the voices contained in the waveform and is associated with the voice to be recorded or the voice not to be recorded. The estimation result of the thing is displayed. In this way, the editor can easily determine whether each audio part relates to the recording target sound or the non-recording target sound based on the display content, and the editing efficiency can be improved. When displaying the estimation result, it may be possible to directly display whether each audio part is related to the recording target sound or the non-recording target sound. Alternatively, the display may be indirectly performed depending on the display mode. In addition, it indicates in stages whether each audio part is closer to the recording target sound or the non-recording target sound (for example, it is closer to the recording target sound, closer to the non-recording target sound, it can be said that neither). May be

【０００７】また、本発明の一態様では、前記推定手段
は、前記波形に含まれる各音声部分の波高値に基づい
て、その音声部分が収録対象音声と非収録対象音声のい
ずれに係るものかを推定する。波高値は、音声部分の１
つの値を用いるようにしてもよいし、複数の値、例えば
全部の値を用いるようにしてもよい。本態様では、例え
ば音圧、実効音圧、波高値の絶対値の平均、波高値のピ
ーク値等の波高値に基づく値を算出することにより、そ
の値が所定閾値よりも大きい音声部分が収録対象音声に
係るものであり、所定閾値以下であるものが非収録対象
音声に係るものである等の推定が可能となる。Further, according to one aspect of the present invention, based on the peak value of each voice portion included in the waveform, the estimating means determines whether the voice portion is a recording target voice or a non-recording target voice. To estimate. The peak value is 1 of the voice part
One value may be used, or a plurality of values, for example, all values may be used. In this aspect, for example, by calculating a value based on the peak value such as sound pressure, effective sound pressure, average absolute value of peak value, peak value of peak value, etc., a voice portion whose value is larger than a predetermined threshold is recorded. It is possible to estimate that the target voice is related to the non-recorded target voice, and the target voice is equal to or less than the predetermined threshold.

【０００８】また、本発明の一態様では、前記推定手段
は、前記波形に含まれる各音声部分の周波数情報に基づ
いて、その音声部分が収録対象音声と非収録対象音声の
いずれに係るものかを推定する。こうすれば、例えば各
音声部分の周波数の特徴により、その音声部分が収録対
象音声と非収録対象音声のいずれに係るものかを推定す
ること等が可能となる。なお、周波数情報は、例えば話
者の基本周波数（ｆ_０）情報、ゼロクロス数、周波数分
布等である。Further, according to one aspect of the present invention, the estimating means determines, based on frequency information of each voice portion included in the waveform, whether the voice portion is a recording target voice or a non-recording target voice. To estimate. This makes it possible, for example, to estimate whether the audio part relates to the recording target sound or the non-recording target sound, based on the frequency characteristics of each audio part. The frequency information is, for example, the fundamental frequency (f ₀ ) information of the speaker, the number of zero crosses, the frequency distribution, and the like.

【０００９】また、本発明の一態様では、前記波形に対
して音声認識処理を施す音声認識手段をさらに含み、前
記推定手段は、前記波形に含まれる各音声部分に対する
前記音声認識処理における認識結果の尤度に基づき、そ
の音声部分が収録対象音声と非収録対象音声のいずれに
係るものかを推定する。こうすれば、例えば認識結果の
尤度が所定閾値よりも高い音声部分が収録対象音声に係
るものであると推定し、所定閾値以下であるものを非収
録対象音声に係るものであると推定すること等が可能と
なる。Further, according to one aspect of the present invention, it further includes voice recognition means for performing voice recognition processing on the waveform, and the estimation means recognizes the recognition result in the voice recognition processing for each voice part included in the waveform. Based on the likelihood of, the audio part is estimated to be related to the recording target sound or the non-recording target sound. In this way, for example, it is estimated that the voice portion whose likelihood of the recognition result is higher than the predetermined threshold is related to the recording target sound, and that the voice portion below the predetermined threshold is related to the non-recording target sound. It becomes possible.

【００１０】また、本発明に係る音声編集プログラム
は、音声の波形を表示するステップと、前記波形に含ま
れる音声部分を判断するステップと、前記波形に含まれ
る各音声部分が、収録対象音声と非収録対象音声のいず
れに係るものかを推定するステップと、前記波形に含ま
れる各音声部分に対応づけて、前記推定するステップで
の推定結果を表示するステップと、をコンピュータに実
行させるものである。Further, the voice editing program according to the present invention includes a step of displaying a waveform of voice, a step of determining a voice portion included in the waveform, and each voice portion included in the waveform is a recording target voice. The computer is made to execute a step of estimating which one of the non-recorded target voices is involved and a step of displaying the estimation result in the estimating step in association with each voice part included in the waveform. is there.

【００１１】本発明によれば、コンピュータの表示手段
に音声の波形が表示されるとともに、その波形に含まれ
る各音声部分に対応づけて、その音声部分が収録対象音
声と非収録対象音声のいずれに係るものかの推定結果が
表示されるので、編集者は表示内容により、各音声部分
が収録対象音声と非収録対象音声のいずれに係るものか
を容易に判断できるようになり、編集効率を向上させる
ことができる。According to the present invention, the waveform of the voice is displayed on the display means of the computer, and the voice portion is associated with each voice portion included in the waveform, and the voice portion is either the recording target voice or the non-recording target voice. Since the estimation result of whether or not the sound is related to is displayed, the editor can easily judge whether each audio part is related to the recording target sound or the non-recording target sound according to the display content, and the editing efficiency can be improved. Can be improved.

【００１２】[0012]

【発明の実施の形態】以下、本発明の好適な実施の形態
について図面に基づき詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Preferred embodiments of the present invention will now be described in detail with reference to the drawings.

【００１３】図１は、本発明の一実施の形態に係る音声
編集装置として動作するコンピュータシステムの構成を
示す図である。同図に示すコンピュータシステム１０で
は、ＣＰＵ（中央処理装置）１４と、画像処理部１６
と、モニタ１８と、ハードディスク記憶装置１９と、Ｒ
ＡＭ（ランダムアクセスメモリ）２０と、ＲＯＭ（リー
ドオンリメモリ）２２と、入出力インタフェース２４，
２８とが、バス１２により相互にデータ授受可能に接続
されている。また、入出力インタフェース２４にはメデ
ィア読み取り装置２６が接続され、入出力インタフェー
ス２８には入力装置３０が接続されている。FIG. 1 is a diagram showing the configuration of a computer system which operates as a voice editing apparatus according to an embodiment of the present invention. In the computer system 10 shown in the figure, a CPU (central processing unit) 14 and an image processing unit 16 are provided.
, Monitor 18, hard disk storage device 19, R
AM (random access memory) 20, ROM (read only memory) 22, input / output interface 24,
28 and 28 are connected to each other via the bus 12 so that data can be exchanged between them. A media reading device 26 is connected to the input / output interface 24, and an input device 30 is connected to the input / output interface 28.

【００１４】ＣＰＵ１４は、メディア読み取り装置２６
から供給されるプログラムを実行し、コンピュータシス
テム１０の各部を制御するものであり、画像処理部１６
はＣＰＵ１４からの制御に従って画像データを生成し、
それを所定タイミングでビデオ信号に変換し、モニタ１
８に出力するものである。モニタ１８は、ＣＲＴやＬＣ
Ｄ等により構成される表示装置である。ハードディスク
記憶装置は、メディア読み取り装置２６により読み取ら
れるプログラム等のデータをインストールしたり、ＣＰ
Ｕ１４の作業用として利用したりされる記憶デバイスで
ある。ＲＡＭ２０は、ＣＰＵ１４の作業用として用いら
れる記憶デバイスである。ＲＯＭ２２には、ＢＩＯＳ
（Basic Input Output System）等のデータが記憶され
る。The CPU 14 is a medium reading device 26.
The image processing unit 16 executes a program supplied from the computer and controls each unit of the computer system 10.
Generates image data under the control of the CPU 14,
It is converted into a video signal at a predetermined timing and the monitor 1
8 is output. The monitor 18 is a CRT or LC
It is a display device including D and the like. The hard disk storage device installs data such as programs read by the media reading device 26, and the CP.
It is a storage device that is used for work of U14. The RAM 20 is a storage device used for the work of the CPU 14. The ROM 22 has a BIOS
Data such as (Basic Input Output System) is stored.

【００１５】入出力インタフェース２４は、ＣＰＵ１４
とメディア読み取り装置２６との間でなされるデータ授
受を中継するものであり、入出力インタフェース２８は
ＣＰＵ１４と入力装置３０との間でなされるデータ授受
を中継するものである。バス１２は、システム各部の間
でなされるデータ及びアドレスの授受に用いられる。The input / output interface 24 is the CPU 14
The data input / output interface 28 relays data exchange between the CPU 14 and the input device 30. The input / output interface 28 relays data exchange between the CPU 14 and the input device 30. The bus 12 is used for exchanging data and addresses between various parts of the system.

【００１６】メディア読み取り装置２６は、ＦＤ（フロ
ッピー（登録商標）ディスク）、ＭＯディスク（光磁気
ディスク）、ＣＤ（コンパクトディスク）−ＲＯＭ、Ｄ
ＶＤ（ディジタルビデオディスク）等の情報記憶媒体か
らプログラム等のデータを読み取る装置である。なお、
ここでは情報記憶媒体からプログラムを供給するものと
するが、コンピュータシステム１０にデータ通信のため
のデバイスを接続し、インターネット等の通信ネットワ
ークを介してプログラムを供給するようにしてもよい。The media reading device 26 includes an FD (floppy (registered trademark) disk), an MO disk (magneto-optical disk), a CD (compact disk) -ROM, and a D.
It is a device for reading data such as programs from an information storage medium such as a VD (digital video disk). In addition,
Here, the program is supplied from the information storage medium, but a device for data communication may be connected to the computer system 10 and the program may be supplied via a communication network such as the Internet.

【００１７】入力装置３０は、例えばキーボード等の文
字入力デバイス、マウス等のポインティングデバイス、
音声編集の対象である音声を入力するためのマイク等を
含む。マイクから入力された音声はディジタル化され、
波形データファイルとしてハードディスク記憶装置１９
に記憶される。その他、予め他の装置で編集対象となる
音声をディジタル化して、波形データファイルとして情
報記憶媒体に記憶し、それをメディア読み取り装置２６
で読み取って、ハードディスク記憶装置１９に記憶して
おくようにしてもよい。或いは、インターネット等の通
信ネットワークを介してコンピュータシステムに波形デ
ータファイルを供給し、ハードディスク記憶装置１９に
記憶しておくようにしてもよい。The input device 30 is, for example, a character input device such as a keyboard, a pointing device such as a mouse,
It includes a microphone and the like for inputting the voice that is the target of voice editing. The voice input from the microphone is digitized,
Hard disk storage device 19 as waveform data file
Memorized in. In addition, the voice to be edited is digitized by another device in advance and stored in the information storage medium as a waveform data file.
It may be read by and stored in the hard disk storage device 19. Alternatively, the waveform data file may be supplied to the computer system via a communication network such as the Internet and stored in the hard disk storage device 19.

【００１８】かかる構成を有するコンピュータシステム
１０に、ＣＤ−ＲＯＭやＤＶＤ等の情報記憶媒体から音
声編集プログラムが供給され、それがハードディスク記
憶装置１９にインストールされることにより、同コンピ
ュータシステム１０が音声編集装置として機能する。A voice editing program is supplied from an information storage medium such as a CD-ROM or a DVD to the computer system 10 having the above-mentioned configuration, and the voice editing program is installed in the hard disk storage device 19 so that the computer system 10 can perform the voice editing. Functions as a device.

【００１９】図２は、コンピュータシステム１０で音声
編集プログラムを起動した場合に、モニタ１８で表示さ
れる音声編集（波形編集）画面の一例を示している。同
図に示す音声編集画面は、例えばＧＵＩ（Graphic User
Interface）をコンピュータシステム１０で採用した場
合には、１つのウィンドウとしてモニタ１８に表示され
るものである。同図に示すように、音声編集画面では、
画面上方に編集対象である音声の波形３１が表示され
る。同図に示す波形３１は、横軸を時間軸とし、縦軸を
振幅（波高値）として収録音声を示したものであり、３
つの音声部分３８，４０，４２が含まれている。FIG. 2 shows an example of a voice editing (waveform editing) screen displayed on the monitor 18 when the computer system 10 activates the voice editing program. The voice editing screen shown in the figure is, for example, a GUI (Graphic User).
Interface) is adopted in the computer system 10, it is displayed on the monitor 18 as one window. As shown in the figure, on the voice editing screen,
The waveform 31 of the voice to be edited is displayed at the top of the screen. A waveform 31 shown in the figure shows the recorded voice with the horizontal axis as the time axis and the vertical axis as the amplitude (peak value).
One audio portion 38, 40, 42 is included.

【００２０】ここで、音声部分とは、ノイズでない音声
に係る波形部分、つまり音声認識処理により音声が認識
される波形部分をいう。各音声部分３８，４０，４２の
開始位置（タイミング）及び終了位置（タイミング）を
マーカ表示等により音声編集画面上で示すようにしても
よい。Here, the voice portion means a waveform portion relating to voice that is not noise, that is, a waveform portion where voice is recognized by the voice recognition processing. The start position (timing) and end position (timing) of each audio portion 38, 40, 42 may be indicated on the audio editing screen by marker display or the like.

【００２１】音声部分３８，４０，４２からは公知の音
声認識処理により音声内容を表す文字列が認識されるよ
うになっており、その文字列の全部又は一部が認識結果
枠３２，３４，３６の中に表示されるようになってい
る。ここで、認識結果表示枠３２，３４，３６の枠線の
うち、左側の縦線は音声部分の開始位置に対応してお
り、右側の縦線は音声部分の終了位置に対応している。
こうして、各音声部分３８，４０，４２に対応づけて、
認識結果表示枠３２，３４，３６がそれぞれ表示され、
それらの枠内に各音声部分３８，４０，４２の音声認識
結果である文字列がそれぞれ表示されるようになってい
る。From the voice portions 38, 40, 42, a character string representing the voice content is recognized by a known voice recognition process, and all or part of the character string is recognized by the recognition result frames 32, 34 ,. It is supposed to be displayed in 36. Here, of the frame lines of the recognition result display frames 32, 34, and 36, the left vertical line corresponds to the start position of the voice portion, and the right vertical line corresponds to the end position of the voice portion.
In this way, in association with each voice part 38, 40, 42,
Recognition result display frames 32, 34, 36 are displayed,
The character strings which are the voice recognition results of the voice portions 38, 40 and 42 are displayed in the respective frames.

【００２２】また、認識結果表示枠３２，３４，３６に
認識結果が表示される場合、そのフォントサイズによ
り、各音声部分が収録対象音声に係るものか非収録対象
音声に係るものかの推定結果が示されるようになってい
る。すなわち、収録対象音声であると推定される場合に
は大きなフォントにより認識結果が表示されるようにな
っており、非収録対象音声であると推定される場合には
小さなフォントにより認識結果が表示されるようになっ
ている。また、いずれとも推定できない場合には中くら
いのフォントにより認識結果が表示されるようになって
いる。こうすれば、編集者は各音声部分に対応して表示
された認識結果のフォントサイズから、その音声部分が
収録対象音声に係る部分であるのか、それとも非収録対
象音声に係る部分であるのか、を一目で把握することが
できる。When the recognition results are displayed in the recognition result display frames 32, 34 and 36, the estimation result of whether each voice part is related to the recording target sound or the non-recording target sound depending on the font size. Is displayed. That is, when it is estimated that it is the recording target voice, the recognition result is displayed in a large font, and when it is estimated that it is the non-recording target voice, the recognition result is displayed in a small font. It has become so. Further, if neither can be estimated, the recognition result is displayed in a medium font. In this way, the editor, from the font size of the recognition result displayed corresponding to each voice portion, whether the voice portion is a portion related to the recording target sound, or is a portion related to the non-recording target sound, You can grasp at a glance.

【００２３】なお、収録対象音声は、例えば歌手の歌
声、会議参加者の発言、声優の喋る台詞等である。一
方、非収録対象音声は、例えば歌手、会議参加者、声優
等の収録対象者の独り言や、スタジオスタッフの声等で
ある。The recording target voice is, for example, a singing voice of a singer, a speech of a conference participant, a dialogue of a voice actor, or the like. On the other hand, the non-recording target voice is, for example, the soliloquy of the recording target person such as a singer, a conference participant, a voice actor, or the voice of the studio staff.

【００２４】また、同図に示す音声編集画面では、図示
を省略するが、波形３１の振幅や時間を示す目盛りや、
音声編集の為の各種ツールを編集者が選択するためのツ
ールバーやメニューも表される。In the voice editing screen shown in the figure, although not shown, a scale indicating the amplitude and time of the waveform 31 and
Toolbars and menus for editors to select various tools for voice editing are also displayed.

【００２５】図３は、音声編集プログラムの一部であ
る、編集画面表示ルーチンを示すフロー図である。音声
編集プログラムには、同編集画面表示ルーチンの他、同
ルーチンにより表示される波形３１を編集者の指示に基
づいて編集するための各種編集ルーチンも含まれる。FIG. 3 is a flow chart showing an editing screen display routine which is a part of the voice editing program. In addition to the edit screen display routine, the voice editing program also includes various edit routines for editing the waveform 31 displayed by the routine based on an instruction from the editor.

【００２６】同図に示すように、編集画面表示ルーチン
では、まず波形ファイル名、表示範囲、縮尺が取得され
る（Ｓ１０１）。波形ファイル名は、編集対象である音
声を収録した波形ファイルの名称（必要に応じてパス
も）であり、例えば編集者がメニュー画面においてマウ
ス等の入力装置３０で指定することにより、或いはファ
イル名をキーボード等の入力装置３０から入力すること
等により、この波形ファイル名が取得される。表示範囲
は、編集対象である音声のうち編集画面に波形３１を表
示する範囲であり、例えば一旦波形３１をモニタ１８に
表示させた上で、マウス等の入力装置３０で横スクロー
ルさせ、或いはＧＵＩにおけるウィンドウサイズを変更
させて、編集者に表示範囲を指示させることにより、こ
の表示範囲が取得される。縮尺は、波形３１の表示縮尺
であり、例えばキーボードや等の入力装置３０による拡
大指示又は縮小処理等により、この縮尺が取得される。As shown in the figure, in the edit screen display routine, first, the waveform file name, display range, and scale are acquired (S101). The waveform file name is the name (and path if necessary) of the waveform file in which the voice to be edited is recorded. For example, the editor may specify the input device 30 such as a mouse on the menu screen, or the file name. The waveform file name is acquired by inputting the input from the input device 30 such as a keyboard. The display range is a range in which the waveform 31 is displayed on the edit screen of the voice to be edited. For example, the waveform 31 is once displayed on the monitor 18 and then horizontally scrolled by the input device 30 such as a mouse or the GUI. The display range is acquired by changing the window size in and allowing the editor to specify the display range. The reduced scale is a display reduced scale of the waveform 31, and the reduced scale is acquired by an enlargement instruction or a reduction process by the input device 30 such as a keyboard.

【００２７】次に、Ｓ１０１で取得された波形ファイル
名を有する波形ファイルがハードディスク記憶装置１９
から読み出され（Ｓ１０２）、それに基づいてＳ１０１
で取得された表示範囲の波形３１がＲＡＭ２０又は画像
処理部１６に含まれるＶＲＡＭに描画される（Ｓ１０
３）。続いて、Ｓ１０２で読み出された波形ファイルに
対して音声認識処理が施され、波形３１に含まれる各音
声部分が特定され、それらの認識結果である文字列が生
成される（Ｓ１０４）。そして、表示範囲のフレーズ数
Ｆ、フレーズ位置（Ｐｓ_ｉ，Ｐｅ_ｉ）、フレーズ文字数
ｙ_ｉ（ｉ＝１〜Ｆ）がＲＡＭ２０に格納される（Ｓ１０
５）ここで、フレーズ数Ｆは表示範囲における音声部分
の数であり、フレーズ位置Ｐｓ_ｉは音声編集画面におけ
るｉ番目の音声部分の開始位置（ｘ座標（水平位置））
であり、フレーズ位置Ｐｅ_ｉは音声編集画面におけるｉ
番目の音声部分の終了位置（ｘ座標（水平位置））であ
る。また、フレーズ文字数ｙ_ｉはｉ番目の音声部分に対
する認識結果の文字数である。その後、フレーズ位置Ｐ
ｅ_ｉからフレーズ位置Ｐｓ_ｉが減算され、これによりフ
レーズ長ｘ_ｉが算出される（Ｓ１０６）。フレーズ長ｘ
_ｉは、音声編集画面におけるｉ番目の音声部分の横方向
（時間軸方向）の長さ（ピクセル数）を表す。Next, the waveform file having the waveform file name acquired in S101 is stored in the hard disk storage device 19
Is read from (S102) and based on it, S101
The waveform 31 of the display range acquired in step S10 is drawn in the RAM 20 or the VRAM included in the image processing unit 16 (S10).
3). Subsequently, voice recognition processing is performed on the waveform file read in S102, each voice portion included in the waveform 31 is specified, and a character string that is the recognition result thereof is generated (S104). The number of phrases F, phrase position of the display range _(Ps i, _Pe i), the phrase number _y i (i = _1~F) is stored in the RAM 20 (S10
5) Here, the phrase number F is the number of voice portions in the display range, and the phrase position Ps _i is the start position (x coordinate (horizontal position)) of the i-th voice portion on the voice editing screen.
And the phrase position Pe _i is _i on the voice editing screen.
It is the end position (x coordinate (horizontal position)) of the th audio part. Further, the phrase character number y _i is the number of characters in the recognition result for the i-th speech portion. After that, the phrase position P
The phrase position Ps _i is subtracted from e _i , whereby the phrase length x _i is calculated (S106). Phrase length x
_i represents the length (number of pixels) in the horizontal direction (time axis direction) of the i-th audio portion on the audio editing screen.

【００２８】次に、音声部分を指定する変数ｉを１に設
定し（Ｓ１０７）、ｉ番目の音声部分に対する音声認識
結果を表示するときのフォントサイズｚ_ｉ（ここでは特
に、１文字を表示するために必要な正方形の表示領域の
一辺の長さ（ピクセル）をいう。）を決定する（Ｓ１０
８）。このフォントサイズｚ_ｉを決定するためには、ま
ずｉ番目の音声部分が収録対象音声を表しているか、そ
れとも非収録音声を表しているかを推定する。そして、
この推定結果に基づいてフォントサイズを決定する。例
えば、収録対象音声であると推定される場合にはフォン
トサイズｚ_ｉに大きなフォントサイズを設定し、非収録
対象音声であると推定される場合にはフォントサイズｚ
_ｉに小さなフォントサイズを設定する。また、いずれと
も判別ができない場合には中くらいの大きさのフォント
サイズを設定する。フォントサイズ決定処理について
は、図５〜図７に基づき、後に詳述する。以上のＳ１０
５、Ｓ１０６及びＳ１０８における処理の結果、ＲＡＭ
２０には図５に示すテーブルが用意されることになる。Next, the variable i designating the voice portion is set to 1 (S107), and the font size z _i (in particular, one character is displayed here) when the voice recognition result for the i-th voice portion is displayed. The length (pixel) of one side of the square display area necessary for this is determined (S10).
8). In order to determine the font size z _i , it is first estimated whether the i-th voice portion represents a recording target voice or a non-recording voice. And
The font size is determined based on this estimation result. For example, a large font size is set to the font size z _i when it is estimated that the target voice is recorded, and a font size z is estimated when the target voice is not recorded.
Set a small font size for _i . If neither can be identified, a medium size font size is set. The font size determination process will be described in detail later with reference to FIGS. Above S10
5, RAM as a result of the processing in S106 and S108
The table shown in FIG. 5 is prepared for 20.

【００２９】次に、フレーズ長ｘ_ｉがフォントサイズｚ
_ｉよりも小さいかどうかを判断する（Ｓ１０９）。フレ
ーズ長ｘ_ｉがフォントサイズｚ_ｉよりも小さい場合に
は、音声部分の直下の表示領域に認識結果表示枠を収め
きれない場合であり、例外文字描画処理が実行される
（Ｓ１１１）。例外文字描画処理は図９に示される処理
である。一方、フレーズ長ｘ_ｉがフォントサイズｚ_ｉ以
上である場合には、音声部分の直下の表示領域に認識結
果表示枠を描画できる場合であり、通常文字描画処理が
実行される（Ｓ１１０）。通常文字描画処理は図８に示
される。例外文字描画処理及び通常文字描画処理は、共
にｉ番目の音声部分の下方に認識結果表示枠を描画する
とともに、その中に認識結果である文字列を描画する処
理である。Next, the phrase length x _i is the font size z
_It is determined whether it is smaller than _i (S109). When the phrase length x _i is smaller than the font size z _i , the recognition result display frame cannot be completely accommodated in the display area immediately below the voice portion, and the exceptional character drawing process is executed (S111). The exceptional character drawing process is the process shown in FIG. On the other hand, when the phrase length x _i is equal to or _larger than the font size z _i , the recognition result display frame can be drawn in the display area immediately below the voice portion, and the normal character drawing process is executed (S110). The normal character drawing process is shown in FIG. The exceptional character drawing process and the normal character drawing process are both processes of drawing a recognition result display frame below the i-th voice portion and drawing a character string as a recognition result therein.

【００３０】通常文字描画処理（Ｓ１１０）又は例外文
字描画処理（Ｓ１１１）が実行された後、変数ｉがフレ
ーズ数Ｆに達したかどうかが判断され（Ｓ１１２）、達
していなければ変数ｉに１が加算され、次の音声部分に
対してＳ１０８からＳ１１２までの処理が再び実行され
る。こうして、変数ｉが１からＦまで順に増加し、それ
ぞれの変数ｉについてＳ１０８からＳ１１２までの処理
が実行されると、それまでに描画された音声編集画面が
モニタ１８により表示される（Ｓ１１４）。例えば音声
編集画面がＲＡＭ２０に描画された場合には、ＣＰＵ１
４はそれを画像処理部１６に転送すると、それが所定タ
イミングでモニタ１８に出力され、音声編集画面が表示
される。After the normal character drawing process (S110) or the exceptional character drawing process (S111) is executed, it is judged whether or not the variable i has reached the number F of phrases (S112). If not, 1 is set to the variable i. Is added, and the processing from S108 to S112 is executed again for the next voice portion. In this way, the variable i increases in order from 1 to F, and when the processing from S108 to S112 is executed for each variable i, the monitor 18 displays the voice edit screen drawn up to that point (S114). For example, when the voice edit screen is drawn in the RAM 20, the CPU 1
4 transfers it to the image processing unit 16 and outputs it to the monitor 18 at a predetermined timing, and a voice edit screen is displayed.

【００３１】以上のようにして、音声編集画面として、
波形３１を表示するとともに、各音声部分の下方に認識
結果表示枠及び認識結果を表示することができる。As described above, as the voice edit screen,
In addition to displaying the waveform 31, the recognition result display frame and the recognition result can be displayed below each voice portion.

【００３２】図５は、フォントサイズ決定処理（Ｓ１０
８）の一例を示すフロー図である。この例では、各音声
部分が収録対象音声に係る部分であるか、非収録音声に
係る部分であるのかを、その音声部分の音圧により推定
するようにしている。この処理では、まずｉ番目の音声
部分（フレーズｉ）の音圧Ｐ_ｉ［ｄＢ］を次式（１）に
より算出する（Ｓ４０１）。ここでは、波形データが１
６ビットＰＣＭデータであるものとしており、Ｘは単位
時間あたりの波高値の絶対値の和をサンプル数で除した
値である。また、０ｘ８０００は８０００Ｈ（ヘキサ）
を意味している。同式（１）によれば、−９６ｄＢで無
音となり、０ｄＢで最大音量となる。FIG. 5 shows a font size determination process (S10).
It is a flowchart which shows an example of 8). In this example, whether each voice part is a part relating to a recording target voice or a part relating to a non-recording voice is estimated from the sound pressure of the voice part. In this process, first, the sound pressure P _i [dB] of the i-th voice part (phrase i) is calculated by the following equation (1) (S401). Here, the waveform data is 1
It is assumed to be 6-bit PCM data, and X is a value obtained by dividing the sum of absolute values of peak values per unit time by the number of samples. Also, 0x8000 is 8000H (hexa)
Means According to the equation (1), there is no sound at -96 dB and the maximum volume is at 0 dB.

【００３３】[0033]

【数１】Ｐ_ｉ＝２０ｌｏｇ_１０（Ｘ／０ｘ８０００） …（１）## EQU1 ## P _i = 20log ₁₀ (X / 0x8000) (1)

【００３４】こうして音圧Ｐ_ｉを算出すると、次に音圧
Ｐ_ｉが所定閾値Ｐ１（例えば−４０ｄＢ）よりも大きい
かを判断する（Ｓ４０２）。そして、所定閾値Ｐ１より
も大きければ、フォントサイズｚ_ｉとして「大（例えば
１２ｐｔ）」を設定する（Ｓ４０４）。一方、所定閾値
Ｐ１未満であれば、次に音圧Ｐ_ｉが所定閾値Ｐ２（例え
ば−６０ｄＢ）よりも大きいかを判断する（Ｓ４０
３）。そして、所定閾値Ｐ２よりも大きければ、フォン
トサイズｚ_ｉとして「中（例えば１０ｐｔ）」を設定す
る（Ｓ４０５）。一方、音圧Ｐ_ｉが所定閾値Ｐ２以下で
あれば、フォントサイズｚ_ｉとして「小（例えば８ｐ
ｔ）」を設定する（Ｓ４０６）。When the sound pressure P _i is calculated in this way, it is then determined whether the sound pressure P _i is larger than a predetermined threshold P1 (for example, -40 dB) (S402). Then, if it is larger than the predetermined threshold P1, "large (for example, 12 pt)" is set as the font size z _i (S404). On the other hand, if it is less than the predetermined threshold P1, it is next determined whether the sound pressure P _i is higher than the predetermined threshold P2 (for example, −60 dB) (S40).
3). Then, if it is larger than the predetermined threshold P2, "medium (for example, 10 pt)" is set as the font size z _i (S405). On the other hand, when the sound pressure P _i is less than or equal to the predetermined threshold P2, the font size z _i is “small (for example, 8 p
t) ”is set (S406).

【００３５】こうすれば、各音声部分の音圧Ｐ_ｉが大き
いほどフォントサイズｚ_ｉを大きくすることができ、編
集者は認識結果表示枠３２，３４，３６等の中に表示さ
れる文字のフォントサイズを確認して、それに対応する
音声部分が収録対象音声に係るものであるのか、非収録
対象音声に係るものであるのか、の判断に役立てること
ができる。In this way, the larger the sound pressure P _i of each voice part, the larger the font size z _i can be, and the editor can recognize the characters displayed in the recognition result display frames 32, 34, 36 and the like. By confirming the font size, it is possible to help determine whether the audio portion corresponding to the font size is related to the recording target sound or the non-recording target sound.

【００３６】なお、一般に、収録対象音声は比較的しっ
かりと録音されており、音量Ｐ_ｉが大きいと考えられる
のに対し、非収録対象音声は音圧Ｐ_ｉが小さいものと考
えられることから、各音声部分の音圧Ｐ_ｉから、その音
声部分が収録対象音声に係るものであるか否かを推定す
るのには合理性があると言える。Generally, the recording target sound is recorded relatively firmly and the sound volume P _i is considered to be large, whereas the non-recording target sound is considered to have a small sound pressure P _i . It can be said that it is rational to estimate from the sound pressure P _i of each voice part whether or not the voice part relates to the recording target voice.

【００３７】なお、音圧は上式（１）により算出される
ものであり、各音声部分の波高値に基づく情報である
が、各音声部分の波高値に基づく他の情報（例えば波高
値のピーク値、実効音圧、波高値の絶対値の和等）に基
づき、その音声部分が収録対象音声に係るものであるか
否かを判断するのも合理性がある。いずれにしても、各
音声部分の波高値（１又は複数）に基づき、各音声部分
が収録対象音声であるか否かを好適に推定することがで
きる。The sound pressure is calculated by the above equation (1) and is information based on the peak value of each voice part, but other information based on the peak value of each voice part (for example, the peak value of the peak value). It is also rational to judge whether or not the sound part is related to the sound to be recorded, based on the peak value, effective sound pressure, sum of absolute values of peak values, etc.). In any case, whether or not each voice portion is the recording target voice can be appropriately estimated based on the crest value (one or more) of each voice portion.

【００３８】図６は、フォントサイズ決定処理（Ｓ１０
８）の他の例を示すフロー図である。この例では、各音
声部分が収録対象音声に係る部分であるか、非収録音声
に係る部分であるのかを、その音声部分の周波数分布に
より推定するようにしている。この処理では、まずｉ番
目の音声部分（フレーズｉ）の周波数分布を算出する
（Ｓ５０１）。周波数分布は、各音声部分が、どの周波
数帯にどの程度の成分を含んでいるか、を表す情報であ
る。FIG. 6 shows a font size determination process (S10).
It is a flowchart which shows the other example of 8). In this example, whether each voice part is a part relating to a recording target voice or a part relating to a non-recording voice is estimated from the frequency distribution of the voice part. In this process, first, the frequency distribution of the i-th voice part (phrase i) is calculated (S501). The frequency distribution is information indicating which frequency band and to what extent each voice part contains.

【００３９】周波数分布を算出すると、次にその周波数
分布が台詞型周波数分布であるか否かを調べる（Ｓ５０
２）。台詞型周波数分布としては、例えば多数の声優が
各種台本を読んだときの平均的な周波数分布等を採用す
ることができる。Ｓ５０２の処理では、例えばＳ５０１
で算出した周波数分布と予め用意した台詞型周波数分布
との一致度を算出し、それが所定閾値を超えるか否かに
より、台詞型か否かを判断するようにすればよい。そし
て、Ｓ５０１で算出した周波数分布が台詞型である場
合、フォントサイズｚ_ｉに「大（例えば１２ｐｔ）」を
設定し、台詞型でなければ、フォントサイズｚ_ｉに「小
（例えば８ｐｔ）」を設定する。After the frequency distribution is calculated, it is next checked whether or not the frequency distribution is a dialogue type frequency distribution (S50).
2). As the speech-type frequency distribution, for example, an average frequency distribution when many voice actors read various scripts can be adopted. In the processing of S502, for example, S501
It is only necessary to calculate the degree of coincidence between the frequency distribution calculated in step 1 and the previously prepared dialogue type frequency distribution, and determine whether or not it is the dialogue type depending on whether or not it exceeds a predetermined threshold value. When calculated frequency distribution S501 is dialogue type, set the "large (eg 12pt)" in the font size _{z i,} if not speech type, the "small (e.g. 8pt)" to the font size _{z i} Set.

【００４０】こうすれば、各音声部分の周波数分布が台
詞型周波数分布に近いときに、フォントサイズｚ_ｉを大
きくすることができ、編集者は認識結果表示枠３２，３
４，３６等の中に表示される文字のフォントサイズを確
認して、それに対応する音声部分が収録対象音声（ここ
では台詞）に係るものであるのか、非収録対象音声（こ
こでは台詞外）に係るものであるのか、の判断に役立て
ることができる。In this way, the font size z _i can be increased when the frequency distribution of each voice portion is close to the speech-type frequency distribution, and the editor can recognize the recognition result display frames 32, 3.
Check the font size of the characters displayed in 4, 36, etc., and determine whether the corresponding voice part is related to the recording target voice (here, dialogue), or the non-recording target voice (here, not dialogue) It can be used to judge whether or not it is related to.

【００４１】なお、ここでは各音声部分の周波数分布が
台詞型か非台詞型かを判断するようにしたが、歌唱型か
非歌唱型かを判断するようにしてもよいし、会話型か非
会話型かを判断するようにしてもよい。いずれにして
も、予め収録対象音声（台詞、歌、会話等）の平均的な
周波数分布を用意しておき、それに各音声部分の周波数
特性が近いかどうかを調べることにより、各音声部分が
収録対象音声に係るものか否かを推定することができ
る。なお、ここでは説明の簡略のために周波数分布のみ
を用いた例を示したが、この他に、周波数分布の動的特
徴パラメータ（変化の仕方）を用いても大変有効であ
る。Although the frequency distribution of each voice part is judged to be speech type or non-speech type here, it may be judged whether it is a singing type or a non-singing type. It may be judged whether it is conversational. In any case, each voice part is recorded by preparing an average frequency distribution of the target voice (speech, song, conversation, etc.) in advance and checking if the frequency characteristics of each voice part are close to it. It can be estimated whether or not the target voice is concerned. Although an example using only the frequency distribution is shown here for simplification of the description, it is also very effective to use the dynamic characteristic parameter (how to change) of the frequency distribution in addition to this.

【００４２】なお、一般に、収録対象音声はマイクに向
かって直接的に発音されることが多いのに対し、非収録
対象音声はマイクの遠方から発音され、或いはマイクの
方向とは外れた向きに発音されることが多いと考えられ
ることから、両者の周波数分布には違いがあり、各音声
部分の周波数分布から、その音声部分が収録対象音声に
係るものであるか否かを推定するのには合理性があると
言える。同様に、各音声部分の基本周波数（ｆ０）やゼ
ロクロス数等の他の周波数情報により、その音声部分が
収録対象音声に係るものであるか否かを推定するのにも
合理性があると言える。In general, the voice to be recorded is often pronounced directly toward the microphone, whereas the voice to be recorded is not emitted from a distance from the microphone, or is in a direction deviating from the direction of the microphone. Since it is thought that it is often pronounced, there is a difference in both frequency distributions, and it is possible to estimate from the frequency distribution of each voice part whether or not that voice part relates to the recording target voice. Can be said to be rational. Similarly, it can be said that it is rational to estimate whether or not the voice portion is related to the recording target voice based on other frequency information such as the fundamental frequency (f0) of each voice portion and the number of zero crossings. .

【００４３】なお、男性の声優が各種台本を読んだとき
の平均的な周波数分布と、女性の声優が各種台本を読ん
だときの平均的な周波数分布と、を予め用意しておき、
Ｓ５０１で算出した周波数分布が、そのいずれに近いか
を調べることにより、各音声部分が、男性の声優が台本
を読んだものであるか、女性の声優が台本を読んだもの
であるか、を推定して、それに応じてフォントサイズや
フォントカラーを変えるようにしてもよい。こうすれ
ば、フォントサイズやフォントカラーを見て、各音声部
分が女性の声優によるものか、男性の声優によるもの
か、の判断に役立てることができる。An average frequency distribution when a male voice actor reads various scripts and an average frequency distribution when a female voice actor reads various scripts are prepared in advance,
By checking which one of the frequency distributions calculated in S501 is closer, it is determined whether each voice portion is a script read by a male voice actor or a woman voice actor. It is possible to estimate and change the font size and font color accordingly. By doing this, it is possible to use the font size and font color to judge whether each voice part is due to a female voice actor or a male voice actor.

【００４４】図７は、フォントサイズ決定処理（Ｓ１０
８）のさらに他の例を示すフロー図である。この例で
は、各音声部分が収録対象音声に係る部分であるか、非
収録音声に係る部分であるのかを、その音声部分に対す
る音声認識処理において得られる認識結果の尤度（尤も
らしさ）により推定するようにしている。すなわち、音
声認識処理では、Ｓ１０４（図３）の音声認識処理で
は、各音声部分に対する認識結果を生成するにあたり、
多数の文字列について認識結果としての尤度を算出し、
そのうち最も尤度が高い文字列を実際の認識結果として
いる。ここでは、各音声部分に対する認識結果に対して
Ｓ１０４で算出された尤度に基づき、その音声部分が収
録対象音声に係るものであるか否かを推定している。FIG. 7 shows the font size determination process (S10).
It is a flowchart which shows another example of 8). In this example, it is estimated whether each voice part is a part related to a recording target voice or a part related to a non-recorded voice based on the likelihood (likelihood) of the recognition result obtained in the voice recognition process for the voice part. I am trying to do it. That is, in the voice recognition process, in the voice recognition process of S104 (FIG. 3), the recognition result for each voice part is generated.
Calculate the likelihood as a recognition result for many character strings,
The character string with the highest likelihood is used as the actual recognition result. Here, based on the likelihood calculated in S104 for the recognition result for each voice part, it is estimated whether or not the voice part relates to the recording target voice.

【００４５】この処理では、まずｉ番目の音声部分（フ
レーズｉ）の尤度α_ｉを取得する（Ｓ６０１）。なお、
尤度α_ｉはＳ１０４の処理において各音声部分に対応づ
けて保存しておくものとする。In this process, first, the likelihood α _i of the i-th voice part (phrase i) is acquired (S601). In addition,
The likelihood α _i is stored in association with each voice part in the processing of S104.

【００４６】尤度α_ｉすると、次に尤度α_ｉ音が所定閾
値α１よりも大きいかを判断する（Ｓ６０２）。そし
て、所定閾値α１よりも大きければ、フォントサイズｚ
_ｉとして「大（例えば１２ｐｔ）」を設定する（Ｓ６０
４）。一方、所定閾値α１未満であれば、次に尤度α_ｉ
が所定閾値α２よりも大きいかを判断する（Ｓ６０
３）。そして、所定閾値α２よりも大きければ、フォン
トサイズｚ_ｉとして「中（例えば１０ｐｔ）」を設定す
る（Ｓ６０５）。一方、尤度α_ｉが所定閾値α２以下で
あれば、フォントサイズｚ_ｉとして「小（例えば８ｐ
ｔ）」を設定する（Ｓ６０６）。After the likelihood α _i , it is then determined whether the likelihood α _i sound is larger than a predetermined threshold α1 (S602). If it is larger than the predetermined threshold value α1, the font size z
“Large (for example, 12 pt)” is set as _i (S60
4). On the other hand, if it is less than the predetermined threshold α1, then the likelihood α _i
Is greater than a predetermined threshold value α2 (S60)
3). If it is larger than the predetermined threshold value α2, "medium (for example, 10 pt)" is set as the font size z _i (S605). On the other hand, if the likelihood α _i is less than or equal to the predetermined threshold α 2, the font size z _i is “small (for example, 8 p
t) ”is set (S606).

【００４７】こうすれば、各音声部分のα_ｉが高いほど
フォントサイズｚ_ｉを大きくすることができ、編集者は
認識結果表示枠３２，３４，３６等の中に表示される文
字のフォントサイズを確認して、それに対応する音声部
分が収録対象音声に係るものであるのか、非収録対象音
声に係るものであるのか、の判断に役立てることができ
る。By doing this, the font size z _i can be increased as the α _{i of} each voice part is higher, and the editor can change the font size of the characters displayed in the recognition result display frames 32, 34, 36 and the like. Can be used to judge whether the audio part corresponding to the audio data is related to the recording target sound or the non-recording target sound.

【００４８】なお、一般に、収録対象音声は比較的明瞭
に録音されており、認識結果の尤度α_ｉが大きいと考え
られるのに対し、非収録対象音声は尤度α_ｉが小さいも
のと考えられることから、各音声部分に対する認識結果
の尤度α_ｉから、その音声部分が収録対象音声に係るも
のであるか否かを推定するのには合理性があると言え
る。Generally, the recording target voice is recorded relatively clearly, and the likelihood α _{i of the} recognition result is considered to be large, whereas the non-recording target voice is considered to have a small likelihood α _i. Therefore, it can be said that it is reasonable to estimate from the likelihood α _i of the recognition result for each voice part whether or not the voice part is related to the recording target voice.

【００４９】次に、図８は、通常文字描画ルーチンを示
す図である。同図に示すように、通常文字描画ルーチン
では、まず次式（２）を満足する行数ｎが決定される
（Ｓ２０１）。Next, FIG. 8 is a diagram showing a normal character drawing routine. As shown in the figure, in the normal character drawing routine, first, the number of lines n that satisfies the following expression (2) is determined (S201).

【００５０】[0050]

【数２】ｘ_ｉ×（ｎ−１）＜ｙ_ｉ×ｚ_ｉ≦ｘ_ｉ×ｎ …（２）## EQU00002 ## x _i × (n-1) <y _i × z _i ≤x _i × n (2)

【００５１】次に、行数ｎが最大行数Ｎ以下であるかが
判断される（Ｓ２０２）。行数ｎが最大行数Ｎ以下の場
合、ｉ番目の音声部分（フレーズ）の下に縦ｚ_ｉ×ｎド
ット、横ｘ_ｉドットの認識結果表示枠が描画される（Ｓ
２０３）。このとき、認識結果表示枠の左上隅のｘ座標
（水平座標）がｉ番目の音声部分のフレーズ（開始）位
置Ｐｓ_ｉとなるように描画する。その後、認識結果表示
枠の中にＳ１０４で得られたｉ番目の音声部分について
の認識結果をフォントサイズｚ_ｉで描画する（Ｓ２０
４）。Next, it is determined whether the number of rows n is less than or equal to the maximum number of rows N (S202). When the number of lines n is less than or equal to the maximum number of lines N, a recognition result display frame of vertical z _i × n dots and horizontal x _i dots is drawn below the i-th voice portion (phrase) (S
203). At this time, the drawing is performed so that the x coordinate (horizontal coordinate) of the upper left corner of the recognition result display frame is the phrase (start) position Ps _{i of the} i-th voice portion. Then, the recognition result for the i-th voice portion obtained in S104 is drawn in the recognition result display frame with the font size z _i (S20).
4).

【００５２】一方、行数ｎが最大行数Ｎよりも大きい場
合、ｉ番目の音声部分の下に縦ｚ_ｉ×Ｎドット、横ｘ_ｉ
ドットの認識結果表示枠が描画される（Ｓ２０５）。こ
のとき認識結果表示枠の左上隅のｘ座標（水平座標）が
ｉ番目の音声部分のフレーズ（開始）位置Ｐｓ_ｉとなる
ように描画する。そして、Ｓ１０４で得られたｉ番目の
音声部分についての認識結果である文字列のうち、音声
編集画面への表示を省略する文字を決定する（Ｓ２０
６）。具体的には、次式（３）で示される省略文字個数
ａ_ｉを算出し、認識結果である文字列のうち、先頭文字
及び末尾文字を除き、ａ_ｉ個の連続する文字を省略文字
として選び出す。ここで、ｉｎｔ（）は括弧内の数値を
整数化する関数である。また、第２項は認識結果表示枠
で表示可能な文字数を表し、第３項は省略文字が存在す
る記号（例えば「…」等）を埋めるための文字数を１つ
用意するために設けられている。On the other hand, when the number of rows n is larger than the maximum number of rows N, vertical z _i × N dots and horizontal x _i are located below the i-th voice portion.
A dot recognition result display frame is drawn (S205). At this time, the drawing is performed so that the x coordinate (horizontal coordinate) of the upper left corner of the recognition result display frame becomes the phrase (start) position Ps _{i of the} i-th voice portion. Then, among the character strings that are the recognition results for the i-th voice portion obtained in S104, the characters whose display on the voice editing screen is omitted are determined (S20).
6). Specifically, the number of abbreviated characters a _i represented by the following equation (3) is calculated, and in the character string as the recognition result, _ai consecutive characters are excluded as abbreviated characters except for the first character and the last character. Pick out. Here, int () is a function that converts the numerical value in parentheses into an integer. The second term represents the number of characters that can be displayed in the recognition result display frame, and the third term is provided to prepare one character number for filling a symbol (for example, "...") in which an abbreviated character exists. There is.

【００５３】[0053]

【数３】[Equation 3]

【００５４】ａ_ｉ＝ｙ_ｉ−ｉｎｔ（ｘ_ｉ／ｚ_ｉ）×Ｎ＋１ …（３）A _i = y _i −int (x _i / z _i ) × N + 1 (3)

【００５５】そして、省略文字として選出されなかった
文字を、Ｓ２０５で描画した認識結果表示枠の中にフォ
ントサイズｚ_ｉで描画する（Ｓ２０７）。このとき、省
略文字が元々存在していた部分には、例えば「…」や
「〜」等、文字が省略されていることを表す記号を表示
するようにすれば好適である。Then, the character not selected as the abbreviated character is drawn with the font size z _i in the recognition result display frame drawn in S205 (S207). At this time, it is preferable to display a symbol indicating that the characters are omitted, such as "..." or "~", in the portion where the omitted characters originally existed.

【００５６】以上のようにすれば、音声部分の直下に認
識結果表示枠を表示し、その中に認識結果である文字列
の一部又は全部を表示することができる。このとき、認
識結果である文字列を一部省略して音声編集画面に表示
する場合でも、先頭文字及び末尾文字は省略されないよ
うにしたので、編集者は容易に音声部分の内容を判断で
きるようになる。さらに、各認識結果表示枠の中に表示
される文字列は、そのフォントサイズが、対応する音声
部分が収録対象音声に係るものであるか否かの推定結果
に応じて決定されるようになっているので、編集者はフ
ォントサイズを見て、各音声部分が収録対象音声に係る
ものであるか否かの判断に役立てることができる。With the above arrangement, the recognition result display frame can be displayed immediately below the voice portion, and a part or the whole of the character string as the recognition result can be displayed therein. At this time, even when the character string as the recognition result is partially omitted and displayed on the voice editing screen, the first character and the last character are not omitted, so that the editor can easily judge the content of the voice portion. become. Furthermore, the font size of the character string displayed in each recognition result display frame is determined according to the estimation result of whether or not the corresponding voice portion is related to the recording target voice. Therefore, the editor can use the font size to judge whether or not each audio portion is related to the recording target audio.

【００５７】次に、図７は、例外文字描画ルーチンを示
す図である。同図に示すように、例外文字描画ルーチン
では、まず次式（４）が満足されるかが判断される（Ｓ
３０１）。Next, FIG. 7 is a diagram showing an exceptional character drawing routine. As shown in the figure, in the exceptional character drawing routine, it is first determined whether the following expression (4) is satisfied (S).
301).

【００５８】[0058]

【数４】Ｐｓ_ｉ＋ｚ_ｉ＞Ｐｓ_ｉ＋１ …（４）## EQU00004 ## Ps _i + z _i > Ps _{i + 1} (4)

【００５９】上式（４）が満足される場合、音声部分の
下側にフォントサイズｚ_ｉで文字を描画すると、右隣の
音声部分の直下に及んでしまい、そこには該音声部分に
対する認識結果を描画できなくなってしまうことから、
例外文字描画ルーチン及びその親プロセスである編集画
面表示ルーチンを中断し、例えばフォントサイズｚ_ｉ、
表示範囲、縮尺の変更を編集者に促すメッセージを表示
する。When the above expression (4) is satisfied, when a character is drawn on the lower side of the voice portion with the font size z _i , the character reaches right below the voice portion on the right side, and there is recognition for the voice portion. Since you can not draw the result,
Interrupt the exception character drawing routine and its parent process, the edit screen display routine, for example, font size z _i ,
Display a message prompting the editor to change the display range and scale.

【００６０】一方、上式（４）が満足されない場合、次
に行数ｎが最大行数Ｎ以下であるかが判断される（Ｓ３
０２）。行数ｎが最大行数Ｎ以下の場合、ｉ番目の音声
部分（フレーズ）の下に縦ｚ_ｉ×ｎドット、横ｚ_ｉドッ
トの認識結果表示枠が描画される（Ｓ３０３）。このと
き、認識結果表示枠の左上隅のｘ座標（水平座標）がｉ
番目の音声部分のフレーズ（開始）位置Ｐｓ_ｉとなるよ
うに描画する。その後、認識結果表示枠の中にＳ１０４
で得られたｉ番目の音声部分についての認識結果をフォ
ントサイズｚ_ｉで描画する（Ｓ３０４）。ここでは、認
識結果が縦書き表示されることになる。On the other hand, if the above equation (4) is not satisfied, then it is determined whether the number of rows n is the maximum number of rows N or less (S3).
02). When the number of lines n is equal to or less than the maximum number of lines N, a recognition result display frame of vertical z _i × n dots and horizontal z _i dots is drawn below the i-th voice portion (phrase) (S303). At this time, the x coordinate (horizontal coordinate) of the upper left corner of the recognition result display frame is i.
It is drawn so as to be the phrase (start) position Ps _{i of the} th voice part. After that, S104 is displayed in the recognition result display frame.
The recognition result for the i-th speech portion obtained in step S3 is drawn with the font size z _i (S304). Here, the recognition result is displayed vertically.

【００６１】一方、行数ｎが最大行数Ｎよりも大きい場
合、ｉ番目の音声部分の下に縦ｚ_ｉ×Ｎドット、横ｚ_ｉ
ドットの認識結果表示枠が描画される（Ｓ３０５）。こ
のとき認識結果表示枠の左上隅のｘ座標（水平座標）が
ｉ番目の音声部分のフレーズ（開始）位置Ｐｓ_ｉとなる
ように描画する。そして、Ｓ１０４で得られたｉ番目の
音声部分についての認識結果である文字列のうち、音声
編集画面への表示を省略する文字を決定する（Ｓ３０
６）。具体的には、上式（３）で示される省略文字個数
ａ_ｉを算出し、認識結果である文字列のうち、先頭文字
及び末尾文字を除き、ａ_ｉ個の連続する文字を省略文字
として選び出す。On the other hand, when the number of rows n is larger than the maximum number of rows N, vertical z _i × N dots and horizontal z _i are located below the i-th voice portion.
A dot recognition result display frame is drawn (S305). At this time, the drawing is performed so that the x coordinate (horizontal coordinate) of the upper left corner of the recognition result display frame becomes the phrase (start) position Ps _{i of the} i-th voice portion. Then, among the character strings that are the recognition results for the i-th voice portion obtained in S104, the characters to be omitted from the display on the voice edit screen are determined (S30).
6). Specifically, the number of abbreviated characters a _i shown in the above equation (3) is calculated, and in the character string that is the recognition result, _ai consecutive characters are excluded as abbreviated characters except for the first character and the last character. Pick out.

【００６２】そして、省略文字として選出されなかった
文字を、Ｓ３０５で描画した認識結果表示枠の中にフォ
ントサイズｚ_ｉで描画する（Ｓ３０７）。このとき、省
略文字が元々存在していた部分には、例えば「…」や
「〜」等、文字が省略されていることを表す記号を表示
するようにすれば好適である。Then, the character not selected as the abbreviated character is drawn with the font size z _i in the recognition result display frame drawn in S305 (S307). At this time, it is preferable to display a symbol indicating that the characters are omitted, such as "..." or "~", in the portion where the omitted characters originally existed.

【００６３】以上のようにすれば、音声部分の下方に認
識結果表示枠を表示し、その中に認識結果である文字列
の一部又は全部を表示することができる。このとき、認
識結果表示枠の右側の枠線（縦線）が、右隣の音声部分
の認識結果表示枠が表示されるべき部分に入り込んでし
まう場合には、処理が中断されるようになる。With the above arrangement, the recognition result display frame can be displayed below the voice part, and a part or all of the character string as the recognition result can be displayed therein. At this time, if the right side frame line (vertical line) of the recognition result display frame enters the part of the speech part on the right side where the recognition result display frame should be displayed, the process is interrupted. .

【００６４】以上説明した音声編集装置（音声編集プロ
グラム）によれば、音声編集画面に波形３１が表示され
るとともに、音声部分３８，４０，４２が、認識結果表
示枠３２，３４，３６の枠線（縦枠線）により、他の部
分（非音声部分）と区画されて表されるので、編集者は
一見しただけで波形３１に含まれる音声部分を把握する
ことができる。また、各音声部分３８，４０，４２に対
応づけて、認識結果表示枠３２，３４，３６がそれぞれ
表示され、その内部に各音声部分３８，４０，４２の認
識結果の全部又は一部が表示されるので、編集者は一見
しただけで波形３１に含まれる各音声部分がどのような
内容のものかを判断することができる。さらに、そのと
きのフォントサイズが、各音声部分が収録音声に係るも
のであるか否かの推定結果に応じて決定されるようにな
っているので、編集者はフォントサイズを見て、その内
容が収録対象音声に係るものか否かの判断に役立てるこ
とができ、音声編集の効率を格段に向上させることがで
きる。According to the voice editing apparatus (voice editing program) described above, the waveform 31 is displayed on the voice editing screen, and the voice portions 38, 40, 42 are displayed in the recognition result display frames 32, 34, 36. Since the line (vertical frame line) is divided from the other part (non-voice part) to be represented, the editor can grasp the voice part included in the waveform 31 at a glance. Further, the recognition result display frames 32, 34, 36 are displayed in association with the voice parts 38, 40, 42, respectively, and all or part of the recognition results of the voice parts 38, 40, 42 are displayed therein. Therefore, the editor can judge what the contents of each audio part included in the waveform 31 are, at a glance. Furthermore, since the font size at that time is determined according to the estimation result of whether or not each voice part is related to the recorded voice, the editor looks at the font size and Can be used to determine whether or not the voice is related to the recording target voice, and the efficiency of voice editing can be significantly improved.

【００６５】なお、本発明は以上の実施の形態に限定さ
れるものではない。The present invention is not limited to the above embodiment.

【００６６】例えば、音声編集画面は図２に示すものに
限らず、様々なパターンを採用することができる。例え
ば、図１０に示すように、各音声部分５８，６０，６２
に対応づけて、波形６４の下側に認識結果表示枠５２，
５４，５６をそれぞれ表示するとともに、上側にマーカ
６６，６８，７０を表示するようにして、このマーカ６
６，６８，７０の大きさを各音声部分５８，６０，６２
が収録対象音声に係るものであるかの推定結果に基づい
て決定するようにしてもよい。その他、音声部分５８，
６０，６２自体の色又はその背景色を同推定結果に応じ
て変えるようにしてもよい。いずれにしても、各音声部
分が収録対象音声に係るものか否かを、その音声部分に
対応づけて表示するようにすれば、音声の編集効率を格
段に向上させることができる。For example, the voice editing screen is not limited to that shown in FIG. 2, and various patterns can be adopted. For example, as shown in FIG. 10, each voice part 58, 60, 62
In association with the recognition result display frame 52,
54 and 56 are displayed, respectively, and markers 66, 68, and 70 are displayed on the upper side.
6, 68, 70 of the size of each voice portion 58, 60, 62
May be determined based on the estimation result of whether or not is related to the recording target voice. In addition, the voice part 58,
The colors of 60 and 62 themselves or the background color thereof may be changed according to the same estimation result. In any case, by displaying whether or not each voice portion is related to the recording target voice in association with the voice portion, the audio editing efficiency can be significantly improved.

【００６７】[0067]

【発明の効果】以上説明したように、本発明によれば、
音声の波形を表示するとともに、前記波形に含まれる音
声部分を判断して、各音声部分が収録対象音声と非収録
対象音声のいずれに係るものかを推定し、各音声部分に
対応づけて推定結果を表示するようにしたので、編集者
は表示内容により、各音声部分が収録対象音声と非収録
対象音声のいずれに係るものかを容易に判断できるよう
になり、編集効率を向上させることができる。As described above, according to the present invention,
While displaying the waveform of the voice, by judging the voice portion included in the waveform, it is estimated whether each voice portion relates to the recording target voice or the non-recording target voice, and it is estimated in association with each voice portion. Since the result is displayed, the editor can easily judge whether each audio part relates to the recording target sound or the non-recording target sound according to the display content, and the editing efficiency can be improved. it can.

[Brief description of drawings]

【図１】本発明の実施の形態に係る音声編集装置とし
て機能するコンピュータシステムの構成を示す図であ
る。FIG. 1 is a diagram showing a configuration of a computer system that functions as a voice editing device according to an embodiment of the present invention.

【図２】音声編集画面の一例を示す図である。FIG. 2 is a diagram showing an example of a voice edit screen.

【図３】音声編集画面表示ルーチンを説明するフロー
図である。FIG. 3 is a flowchart illustrating a voice edit screen display routine.

【図４】音声編集画面表示ルーチンにおいて生成され
るテーブルを示す図である。FIG. 4 is a diagram showing a table generated in a voice edit screen display routine.

【図５】フォントサイズ決定ルーチンの一例を説明す
るフロー図である。FIG. 5 is a flowchart illustrating an example of a font size determination routine.

【図６】フォントサイズ決定ルーチンの他の例を説明
するフロー図である。FIG. 6 is a flowchart illustrating another example of a font size determination routine.

【図７】フォントサイズ決定ルーチンのさらに他の例
を説明するフロー図である。FIG. 7 is a flowchart illustrating still another example of the font size determination routine.

【図８】通常文字描画ルーチンを説明するフロー図で
ある。FIG. 8 is a flowchart illustrating a normal character drawing routine.

【図９】例外文字描画ルーチンを説明するフロー図で
ある。FIG. 9 is a flowchart illustrating an exceptional character drawing routine.

【図１０】音声編集画面の変形例を示す図である。FIG. 10 is a diagram showing a modification of the voice editing screen.

[Explanation of symbols]

１０コンピュータシステム、１２バス、１４ＣＰ
Ｕ、１６画像処理部、１８モニタ、１９ハードデ
ィスク記憶装置、２０ＲＡＭ、２２ＲＯＭ、２４，
２８入出力インタフェース、２６メディア読み取り
装置、３０入力装置、３１，６４波形、３８，４
０，４２，５８，６０，６２音声部分、３２，３４，
３６，５２，５４，５６認識結果表示枠、６６，６
８，７０（推定結果表示）マーカ。10 computer systems, 12 buses, 14 CP
U, 16 image processing unit, 18 monitor, 19 hard disk storage device, 20 RAM, 22 ROM, 24,
28 input / output interface, 26 media reading device, 30 input device, 31, 64 waveform, 38, 4
0, 42, 58, 60, 62 voice part, 32, 34,
36, 52, 54, 56 Recognition result display frame, 66, 6
8,70 (Estimation result display) Marker.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＧ１０Ｌ 17/00 Ｇ１０Ｌ 3/00 ５５１Ｂ (56)参考文献特開平８−63186（ＪＰ，Ａ) 特開平７−49695（ＪＰ，Ａ) 特開平10−222187（ＪＰ，Ａ) 特開平11−109988（ＪＰ，Ａ) 特開2002−297188（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 G06F 3/16 ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI G10L 17/00 G10L 3/00 551B (56) References JP-A-8-63186 (JP, A) JP-A-7-49695 ( JP, A) JP 10-222187 (JP, A) JP 11-109988 (JP, A) JP 2002-297188 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB) Name) G10L 15/00-17/00 G06F 3/16

Claims

(57) [Claims]

1. A waveform display unit for displaying a waveform of a voice, a voice portion determination unit for determining a voice portion included in the waveform, and each voice portion included in the waveform includes a recording target voice and a non-recording target voice. An estimating means for estimating which one of the waveforms is related, a voice recognizing means for performing a voice recognizing process on the waveform, and a voice recognizing process for each voice part included in the waveform.
The font size of the character string that is the recognition result in
The audio part was estimated to be recorded by the estimation means.
If the size is estimated to be non-captured, in the first size
Determines a second size smaller than the first size
Font size determining means and the font determined by the font size determining means
Depending on the size, for each audio part included in the waveform
All the character strings that are the recognition results in the voice recognition process
Or a character string display means for displaying a part of the character string in association with each of the sound parts.

2. The voice editing apparatus according to claim 1, wherein the estimating unit determines whether the voice portion is a recording target voice or a non-recording target voice, based on a peak value of each voice portion included in the waveform. An audio editing apparatus, which estimates whether or not it is related to.

3. The audio editing apparatus according to claim 1, wherein the estimating unit includes a recording target voice and a non-recording target voice based on frequency information of each voice unit included in the waveform. An audio editing apparatus, which estimates which one of the above.

In audio editing apparatus according to any one of claims 4] claims 1 to 3, before Symbol estimating means, on the basis of the likelihood of the recognition result of the speech recognition processing for each speech portion included in said waveform,
An audio editing apparatus, which estimates whether the audio part relates to a recording target sound or a non-recording target sound.

5.The voice according to any one of claims 1 to 4.
In the editing device, In the font size determining means, the audio part is
It is not possible to distinguish between recorded and non-recorded objects by the fixed means.
If not, the voice recognition for the voice part For processing
The font size of the character string which is the recognition result
Smaller than one size and larger than the second size
A voice editing device characterized by deciding to a third size
Place

6. A step of displaying a waveform of a voice, a step of determining a voice portion included in the waveform, wherein each voice portion included in the waveform relates to a recording target voice and a non-recording target voice. Whether or not the voice recognition processing is performed on the waveform, and the voice recognition processing for each voice portion included in the waveform.
For the character string that is the recognition result in the processing step
The size of the speech is estimated by the step of estimating the speech part.
If it is estimated that
If it is estimated to be recorded, it is smaller than the first size
The second size and the font size.
Depending on the font size, each audio part included in the waveform
Sentence which is the recognition result in the voice recognition process for the minute
A voice editing program for causing a computer to execute the step of displaying all or part of a character string in association with each voice portion.