JP2011186396A

JP2011186396A - Speech recording device, speech recording method and program

Info

Publication number: JP2011186396A
Application number: JP2010054553A
Authority: JP
Inventors: Takeshi Iwaki; 健岩木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2010-03-11
Filing date: 2010-03-11
Publication date: 2011-09-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recording device, a speech recording method and a program, capable of precisely coordinating a speech with additional information for speech retrieval. <P>SOLUTION: The device includes: a recording sentence acquiring section 11 for acquiring a recording sentence T for expressing speech to be recorded, in a sentence unit; a recording sentence dividing section 12 for dividing the recording sentence into exhalation paragraph units by language analysis processing; a recording sentence dividing section 13 for dividing speech corresponding to the recording sentence in the exhalation paragraph unit; a start/end detecting section 14 for detecting a start/end point of speech from a recording result of voice; a silence period length calculation section 15 for calculating a silence period length between exhalation paragraph units included in the recording sentence; a data creating section 16 for creating the speech data corresponding to the recording sentence, and the additional information used for retrieval of the speech, from a recording result of the speech, a detection result of the start/end point, and a calculation result of the silence period; and a data storage section 17 for storing the speech data corresponding to the recording sentence and the additional information in the sentence unit. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声収録装置、音声収録方法およびプログラムに関する。 The present invention relates to an audio recording device, an audio recording method, and a program.

音声案内システム、コーパスベースの音声合成システム等、収録音声を利用した音声利用システムが普及している。これらのシステムは、収録音声を蓄積（データベース化）しておき、音声利用時に適切な音声区間で検索・接続・信号処理して、所望の音声を生成している。 Voice utilization systems using recorded voice, such as voice guidance systems and corpus-based voice synthesis systems, have become widespread. In these systems, recorded voices are stored (in a database), and desired voices are generated by searching, connecting, and processing signals in an appropriate voice section when using the voices.

このため、音声は、音声検索用の付加情報に関連付けて格納されている。一般に、付加情報には、音声区間を定義するために、音声を表す言語情報や音響情報、またはこれらに基づき定義された識別情報等が用いられる。付加情報は、例えば、音声のテキスト表記、音声のヨミやアクセント位置を表す記号、音声の識別情報、音声中の音素の区間情報等である。なお、付加情報については、例えば下記特許文献１に記載されている。結果として、音声利用システムの利用価値は、収録音声の品質、付加情報の品質、およびデータベースの構築コストの３つの観点から評価されることになる。 For this reason, the voice is stored in association with the additional information for voice search. In general, in order to define a speech section, additional information includes language information and acoustic information representing speech, or identification information defined based on these. The additional information is, for example, a voice text notation, a symbol representing a voice smear or accent position, voice identification information, section information of phonemes in the voice, and the like. The additional information is described in, for example, Patent Document 1 below. As a result, the utility value of the voice utilization system is evaluated from the three viewpoints of recorded voice quality, additional information quality, and database construction cost.

特開２００１−３４２８３号公報JP 2001-34283 A

付加情報は、音声区間の定義、音響パラメータの統計解析、音響モデルの構築等、様々な用途に用いられ、利用にあたっては、その用途に応じて満たすべき品質を確保する必要がある。 The additional information is used for various purposes such as the definition of a voice section, statistical analysis of acoustic parameters, construction of an acoustic model, and the like, and when used, it is necessary to ensure the quality to be satisfied according to the purpose.

付加情報の品質を確保するための例として、音声コーパスの構築等に用いられる音素ラベリング処理では、手作業による付加情報の修正が行われている。また、不規則な「間」・抑揚・速度を伴う発話、言い淀み・言い直しを伴う発話、特に収録技術者を伴わない自由な発話の音声解析では、音声の文単位やフレーズ（呼気段落）単位での、音声と付加情報の対応付け精度を確保できない場合が多い。これらの問題は、データベースの構築コストの高騰、品質の低下を招き、ひいては音声利用システムの普及を妨げる原因となっている。 As an example for ensuring the quality of the additional information, in the phoneme labeling process used for constructing a speech corpus, the additional information is manually corrected. Also, in speech analysis of utterances with irregular “between”, inflections, speed, utterances with speech and restatement, especially free utterances without the recording engineer, speech sentence units and phrases (exhalation paragraph) In many cases, the accuracy of associating voice and additional information in units cannot be ensured. These problems cause a rise in database construction costs and a decrease in quality, which in turn prevent the spread of voice utilization systems.

これらの問題を解消するために、単純に、収録すべき音声単位を短くし、呼気段落単位や単語単位で収録して、付加情報の精度を向上することも考えられる。しかし、一般に、文単位の入力に対応する音声利用システムを構築するには、文単位で韻律情報を生成する必要があるため、収録音声もまた、文単位で収集およびデータベース化することを要求される。 In order to solve these problems, it is possible to simply shorten the audio unit to be recorded and record it in the exhalation paragraph unit or word unit to improve the accuracy of the additional information. However, in general, it is necessary to generate prosodic information for each sentence in order to construct a speech utilization system that supports sentence-by-sentence input. Therefore, recorded voices are also required to be collected and databased in sentence units. The

そこで、本発明は、音声と音声検索用の付加情報を高い精度で対応付け可能な、音声収録装置、音声収録方法およびプログラムを提供しようとするものである。 Accordingly, the present invention is intended to provide a voice recording device, a voice recording method, and a program capable of associating voice and additional information for voice search with high accuracy.

本発明のある観点によれば、収録すべき音声を表す収録文章を文単位で取得する収録文章取得部と、収録文章を言語解析処理により呼気段落単位に分割する収録文章分割部と、収録文章に対応する音声を呼気段落単位で収録する音声収録部と、音声の収録結果から音声の始終端時点を検出する始終端検出部と、収録文章に含まれる呼気段落間の無音区間長を算出する無音区間長算出部と、音声の収録結果、始終端時点の検出結果、および無音区間の算出結果から、収録文章に対応する音声データと、音声の検索に用いる付加情報を生成するデータ生成部と、収録文章に対応する音声データおよび付加情報を文単位で格納するデータ格納部とを備える音声収録装置が提供される。 According to an aspect of the present invention, a recorded sentence acquisition unit that acquires a recorded sentence representing speech to be recorded in sentence units, a recorded sentence division unit that divides the recorded sentence into exhalation paragraph units by language analysis processing, and a recorded sentence A voice recording unit that records the speech corresponding to each exhalation paragraph, a start / end detection unit that detects the start / end time of the speech from the audio recording results, and a silence interval length between the exhalation paragraphs included in the recorded sentence is calculated A silent section length calculation unit, a voice generation result corresponding to a recorded sentence, and a data generation unit that generates additional information used for searching for the voice from the recording result of the voice, the detection result of the start and end points, and the calculation result of the silent section There is provided an audio recording device including audio data corresponding to a recorded sentence and a data storage unit that stores additional information in units of sentences.

かかる構成によれば、収録文章に対応する音声が呼気段落単位で収録され、音声の始終端時点が検出され、呼気段落間の無音区間長が算出される。そして、音声の収録結果、始終端時点の検出結果、および無音区間の算出結果から、収録文章に対応する音声データおよび付加情報が生成されて文単位で格納される。これにより、音声の始終端時点を精度よく検出でき、音声と付加情報を高い精度で対応付けることができる。結果として、収録音声の品質、付加情報の品質、およびデータベースの構築コストに優れた音声利用システムを提供することができる。 According to such a configuration, the voice corresponding to the recorded sentence is recorded in units of expiratory paragraphs, the start and end points of the voice are detected, and the silent section length between the expiratory paragraphs is calculated. Then, voice data and additional information corresponding to the recorded sentence are generated from the voice recording result, the detection result at the start and end points, and the calculation result of the silent section, and stored in sentence units. This makes it possible to accurately detect the start and end points of speech and to associate speech with additional information with high accuracy. As a result, it is possible to provide a voice utilization system that is excellent in recorded voice quality, additional information quality, and database construction cost.

上記データ生成部は、発話音声の始終端時点、収録文章の形態素解析結果、および後続する呼気段落との間の無音区間長の少なくともいずれかを付加情報として生成してもよい。 The data generation unit may generate at least one of a start / end time point of the uttered voice, a morphological analysis result of the recorded sentence, and a silent section length between the following exhalation paragraph as additional information.

上記収録文章分割部は、音声の境界情報およびポーズ位置を予測可能な言語解析手法により、収録文章を呼気段落単位に分割してもよい。 The recorded sentence dividing unit may divide the recorded sentence into exhalation paragraph units by a language analysis method capable of predicting voice boundary information and pause positions.

上記音声収録部は、呼気段落単位を示すガイダンスに従って発声される音声を収録してもよい。 The voice recording unit may record a voice uttered in accordance with a guidance indicating an exhalation paragraph unit.

上記無音区間長算出部は、収録文章の分割結果に基づき、呼気段落間の無音区間長を算出してもよい。 The silence interval length calculation unit may calculate a silence interval length between exhalation paragraphs based on a result of dividing the recorded sentence.

上記無音区間長算出部は、収録文章に対応する音声の収録結果に基づき、呼気段落間の無音区間長を算出してもよい。 The silence interval length calculation unit may calculate a silence interval length between exhalation paragraphs based on a recording result of sound corresponding to the recorded sentence.

また、本発明の別の観点によれば、収録すべき音声を表す収録文章を文単位で取得するステップと、収録文章を言語解析処理により呼気段落単位に分割するステップと、収録文章に対応する音声を呼気段落単位で収録するステップと、音声の収録結果から音声の始終端時点を検出するステップと、収録文章に含まれる呼気段落間の無音区間長を算出するステップと、音声の収録結果および無音区間の算出結果から、収録文章に対応する音声データと、音声の検索に用いる付加情報を生成するステップと、収録文章に対応する音声データおよび付加情報を文単位で格納するステップとを含む音声収録方法が提供される。 Further, according to another aspect of the present invention, a step of acquiring recorded sentences representing speech to be recorded in units of sentences, a step of dividing the recorded sentences into units of expiratory paragraphs by language analysis processing, and corresponding to the recorded sentences A step of recording speech in units of expiratory paragraphs, a step of detecting the start and end points of speech from the recording results of speech, a step of calculating a silent interval length between exhalation paragraphs included in the recorded sentence, a recording result of speech and Speech including speech data corresponding to recorded sentences, additional information used for speech search, and storing speech data and additional information corresponding to the recorded sentences in sentence units from the calculation result of the silent section A recording method is provided.

また、本発明の別の観点によれば、上記音声収録方法をコンピュータに実行させるためのプログラムが提供される。ここで、プログラムは、コンピュータ読取り可能な記録媒体を用いて提供されてもよく、通信手段を介して提供されてもよい。 Moreover, according to another viewpoint of this invention, the program for making a computer perform the said audio | voice recording method is provided. Here, the program may be provided using a computer-readable recording medium or may be provided via communication means.

以上説明したように本発明によれば、音声と音声検索用の付加情報を高い精度で対応付け可能な、音声収録装置、音声収録方法およびプログラムを提供することができる。 As described above, according to the present invention, it is possible to provide a voice recording device, a voice recording method, and a program capable of associating voice and additional information for voice search with high accuracy.

本発明の実施形態に係る音声収録装置の主要な機能構成を示すブロック図である。It is a block diagram which shows the main function structures of the audio | voice recording apparatus which concerns on embodiment of this invention. 音声収録装置の動作手順を示すフロー図である。It is a flowchart which shows the operation | movement procedure of an audio | voice recording apparatus. 呼気段落単位での収録文章の分割結果を例示する図である。It is a figure which illustrates the division | segmentation result of the collection text in a breath paragraph unit. 発話音声の収録時に発話者に通知されるガイダンスを例示する図である。It is a figure which illustrates the guidance notified to a speaker at the time of recording of speech sound. 音声データおよび付加情報の生成結果を例示する図である。It is a figure which illustrates the production | generation result of audio | voice data and additional information. 音響解析処理の処理結果を例示する図である。It is a figure which illustrates the process result of an acoustic analysis process. 音声切出し処理の処理結果を例示する図である。It is a figure which illustrates the process result of an audio | voice extraction process.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

図７には、音声切出し処理の処理結果が示されている。音声切出し処理では、収録音声と収録文章Ｔに基づき音声がプログラム分析され、音素系列の音響情報から音声区間が推定され、音声区間と言語情報が時間軸上で対応付けられる。 FIG. 7 shows the processing result of the sound extraction process. In the voice extraction process, the voice is program-analyzed based on the recorded voice and the recorded sentence T, the voice section is estimated from the phoneme sequence acoustic information, and the voice section and the language information are associated on the time axis.

図７では、連続する発話音声「六月八日、水曜日」に対応する音声区間Ｓｖから「六月八日」に対応する音声区間Ｓｖａを切出す場合が示されている。連続する発話音声は、図７（ａ）に示すように収録されている。図７（ｂ）では、音声区間Ｓｖ（音声区間ＳｖａおよびＳｖｂを含む。）が切出されてしまい、切出し処理に失敗している。一方、図７（ｃ）では、音声区間Ｓｖａが切出されており、切出し処理に成功している。 FIG. 7 shows a case where the speech segment Sva corresponding to “June 8th” is cut out from the speech segment Sv corresponding to the continuous utterance speech “June 8th, Wednesday”. Continuous speech sounds are recorded as shown in FIG. In FIG. 7B, the voice section Sv (including voice sections Sva and Svb) is cut out, and the cut-out process has failed. On the other hand, in FIG.7 (c), audio | voice area Sva is cut out and the cut-out process was successful.

ここで、失敗の原因としては、例えば大きな定常雑音やパルス状の突発雑音等による発話音声の始終端時点の誤検出、不適切な音響モデルによるあてはめ、言い淀み・言い直し等による発話音声と収録文章Ｔの乖離等が挙げられる。特に、発話音声の始終端時点を精度よく検出できない場合には、音声区間を適切に切出せなくなる。しかし、現状の技術では、これらの原因を全て考慮した上で、発話音声を高い精度で、かつ自動的に切出すことは困難であるとされている。 Here, the cause of the failure is, for example, misdetection of the start and end points of speech sound due to large stationary noise or pulse-like sudden noise, fitting with an inappropriate acoustic model, speech speech due to speech / rephrase, etc. For example, the divergence of the sentence T can be cited. In particular, when the start and end points of the speech voice cannot be detected with high accuracy, the voice section cannot be appropriately cut out. However, with the current technology, it is considered difficult to automatically cut out the uttered voice with high accuracy in consideration of all these causes.

［１．音声収録装置１０の構成］
図１には、本発明の実施形態に係る音声収録装置１０の機能構成が示されている。図１に示すように、音声収録装置１０は、収録文章取得部１１、収録文章分割部１２、音声収録部１３、始終端検出部１４、無音区間長算出部１５、データ生成部１６、データ格納部１７、データベース１８、および制御部１９を含んで構成される。 [1. Configuration of audio recording apparatus 10]
FIG. 1 shows a functional configuration of an audio recording apparatus 10 according to the embodiment of the present invention. As shown in FIG. 1, the voice recording device 10 includes a recorded sentence acquisition unit 11, a recorded sentence division unit 12, a voice recording unit 13, a start / end detection unit 14, a silent section length calculation unit 15, a data generation unit 16, and a data storage. The unit 17, the database 18, and the control unit 19 are included.

収録文章取得部１１は、収録すべき発話音声を表す収録文章Ｔを文単位で取得する。収録文章Ｔは、テキストデータ等として取得され、収録文章分割部１２に供給される。収録文章分割部１２は、言語解析処理により収録文章Ｔを呼気段落単位に分割する。分割された収録文章Ｔは、テキストデータ等として音声収録部１３および無音区間長算出部１５に供給される。 The recorded sentence acquisition unit 11 acquires a recorded sentence T representing the uttered voice to be recorded in sentence units. The recorded sentence T is acquired as text data or the like and supplied to the recorded sentence division unit 12. The recorded sentence division unit 12 divides the recorded sentence T into units of expiratory paragraphs by language analysis processing. The divided recorded sentence T is supplied to the voice recording unit 13 and the silent section length calculation unit 15 as text data or the like.

ここで、呼気段落とは、韻律句を意味し、呼気段落区切りとは、韻律句の境界を意味する。文章を発声した場合の基本周波数の時間変化パターンは、１〜２文節程度が一まとまりとなり形成される局所的な起伏（アクセント句）と、発話の開始から時間とともに緩やかに下降する大局的な変化により表現される。多くの隣接するアクセント句の境界では、大局的な下降特性が保たれ、同一の韻律的なまとまりである韻律句を形成する。一方、一部の隣接するアクセント句の境界では、基本周波数が下降せずに立て直される、韻律句の境界が生じる。 Here, the exhalation paragraph means a prosodic phrase, and the exhalation paragraph break means a boundary of the prosodic phrase. When the sentence is spoken, the time-varying pattern of the fundamental frequency is a local undulation (accent phrase) that is formed as a group of about one or two phrases, and a global change that gradually falls with time from the start of the utterance. It is expressed by At the boundary of many adjacent accent phrases, a global descending characteristic is maintained, and prosodic phrases that are the same prosodic unit are formed. On the other hand, at the boundary between some adjacent accent phrases, a prosodic phrase boundary is generated in which the fundamental frequency is reestablished without decreasing.

音声収録部１３は、マイクロホン１３ａ等を通じて、収録文章Ｔに対応する発話音声を呼気段落単位で収録する。発話音声の収録結果は、始終端検出部１４に供給される。始終端検出部１４は、発話音声の始終端時点を呼気段落単位で検出する。始終端時点の検出結果は、データ生成部１６に供給される。 The voice recording unit 13 records the uttered voice corresponding to the recorded sentence T in units of exhalation paragraphs through the microphone 13a and the like. The recorded result of the uttered voice is supplied to the start / end detection unit 14. The start / end detection unit 14 detects the start / end point of the uttered voice in units of expiratory paragraphs. The detection result at the start / end time is supplied to the data generation unit 16.

無音区間長算出部１５は、収録文章Ｔに含まれる呼気段落間の無音区間Ｓｓの長さを算出する。無音区間Ｓｓの長さの算出結果は、データ生成部１６に供給される。無音区間Ｓｓは、ポーズまたはショートポーズとも称され、発話時の意識的または無意識的な息継ぎによる「間」を意味する。 The silent section length calculation unit 15 calculates the length of the silent section Ss between exhalation paragraphs included in the recorded sentence T. The calculation result of the length of the silent section Ss is supplied to the data generation unit 16. The silent section Ss is also called a pause or a short pause, and means “between” by conscious or unconscious breathing at the time of speech.

データ生成部１６は、音声の収録結果、始終端時点の検出結果、および無音区間Ｓｓの算出結果から、収録文章Ｔに対応する音声データおよび付加情報を生成する。生成された音声データおよび付加情報は、文単位でデータ格納部１７に供給される。 The data generation unit 16 generates voice data and additional information corresponding to the recorded sentence T from the voice recording result, the detection result of the start and end points, and the calculation result of the silent section Ss. The generated voice data and additional information are supplied to the data storage unit 17 in sentence units.

具体的に、音声データは、呼気段落毎に音声波形と無音波形の組合せを生成し、さらに収録文章Ｔに含まれる全ての呼気段落の組合せを結合して生成される。付加情報は、例えば、発話音声の始終端時点、収録文章Ｔの形態素解析結果、および後続する呼気段落との間の無音区間Ｓｓの長さ等である。 Specifically, the voice data is generated by generating a combination of a voice waveform and an acoustic waveform for each exhalation paragraph, and further combining all exhalation paragraph combinations included in the recorded sentence T. The additional information is, for example, the start and end points of the uttered voice, the morphological analysis result of the recorded sentence T, the length of the silent section Ss between the following exhalation paragraphs, and the like.

データ格納部１７は、音声データおよび付加情報を文単位でデータベース１８に格納する。制御部１９は、音声収録装置１０を動作させるために必要な演算処理および制御処理を行う。 The data storage unit 17 stores voice data and additional information in the database 18 in sentence units. The control unit 19 performs arithmetic processing and control processing necessary for operating the audio recording device 10.

なお、上記構成のうち少なくとも一部については、音声収録装置１０上で動作するソフトウェア（プログラム）により実現されてもよく、ハードウェアにより実現されてもよい。また、ソフトウェアにより実現される場合には、プログラムが音声収録装置１０上に予め格納されてもよく、外部から供給されてもよい。 Note that at least a part of the above configuration may be realized by software (program) operating on the audio recording device 10 or may be realized by hardware. When implemented by software, the program may be stored in advance on the audio recording device 10 or supplied from the outside.

［２．音声収録装置１０の動作］
図２には、音声収録装置１０の動作手順が示されている。音声収録装置１０は、発話音声の収録および音声データの格納を収録文章Ｔに相当する文単位で行う。 [2. Operation of the audio recording device 10]
FIG. 2 shows an operation procedure of the audio recording apparatus 10. The voice recording device 10 records uttered voices and stores voice data for each sentence corresponding to the recorded sentence T.

図２に示すように、収録文章取得部１１は、収録すべき発話音声を表す収録文章Ｔを文単位で取得する（ステップＳ１１）。収録文章Ｔは、記憶装置（不図示）に記憶された収録文章セットから文単位でフェッチされてもよく、ネットワーク（不図示）を通じて受信されてもよく、キーボード等の入力装置（不図示）を通じて入力されてもよい。 As shown in FIG. 2, the recorded sentence acquisition part 11 acquires the recorded sentence T showing the speech sound which should be recorded per sentence (step S11). The recorded sentence T may be fetched in units of sentences from a recorded sentence set stored in a storage device (not shown), may be received through a network (not shown), or through an input device (not shown) such as a keyboard. It may be entered.

収録文章分割部１２は、言語解析処理により収録文章Ｔを呼気段落単位に分割する（Ｓ１２）。言語解析処理には、音声の境界情報およびポーズ位置を予測可能な解析手法が用いられる。解析手法としては、例えば、以下の文献に記載された手法が用いられてもよい。
「確率文脈自由文法を用いた韻律句境界とポーズ位置の予測」、藤尾他、電子情報通信学会論文誌Ｄ−ＩＩ、Ｖｏｌ．Ｊ８０−Ｄ−ＩＩ、Ｎｏ．１、ｐｐ．１８−２５、１９９７年１月 The recorded sentence dividing unit 12 divides the recorded sentence T into units of expiratory paragraphs by language analysis processing (S12). For the language analysis processing, an analysis method capable of predicting voice boundary information and pause positions is used. As an analysis method, for example, a method described in the following document may be used.
"Prediction of prosodic phrase boundaries and pause positions using probabilistic context-free grammar", Fujio et al., IEICE Transactions D-II, Vol. J80-D-II, no. 1, pp. 18-25, January 1997

一般に、隣接するアクセント句の境界のうち、所定の閾値を超えるポーズ長を伴う境界を呼気段落区切りとみなすことができる。このため、収録文章Ｔに解析処理を施し、アクセント句間のポーズ長を予測し、所定の閾値を超えるポーズ長を伴う境界を呼気段落区切りとみなして、収録文章Ｔを呼気段落単位に分割する。 In general, a boundary having a pause length exceeding a predetermined threshold among the boundaries of adjacent accent phrases can be regarded as a breath paragraph break. For this reason, the recorded sentence T is analyzed, the pose length between accent phrases is predicted, the boundary with a pose length exceeding a predetermined threshold is regarded as an exhalation paragraph break, and the recorded sentence T is divided into exhalation paragraph units. .

図３には、呼気段落単位での収録文章Ｔの分割結果が示されている。図３では、「あらゆる現実をすべて自分の方へねじ曲げたのだ。」という収録文章Ｔが第１〜第３の呼気段落に分割されている。 FIG. 3 shows a division result of the recorded sentence T in the exhalation paragraph unit. In FIG. 3, the recorded sentence T “All the reality is twisted toward you” is divided into first to third exhalation paragraphs.

図３では、まず、収録文章Ｔは、収録文章Ｔのヨミやアクセント位置に基づき、「あらゆる」、「現実を」、「すべて」、「自分の」、「方へ」、「ねじ曲げたのだ」というアクセント句に分割される。次に、収録文章Ｔは、アクセント句の境界のポーズ長に基づき、「あらゆる現実を」、「すべて自分の方へ」、「ねじ曲げたのだ」という第１〜第３の呼気段落に分割される。例えば、図３では、第１−第２呼気段落間では、ポーズ長（ＢｒｅａｔｈＰａｒａｇｒａｐｈＰａｕｓｅ）０．３５秒が予測され、第２−第３呼気段落間では、ポーズ長０．１５秒が予測されている。 In FIG. 3, first, the recorded sentence T is “everything”, “reality”, “all”, “my”, “toward”, “twisted” based on the reading and accent position of the recorded sentence T. To accent phrases. Next, the recorded sentence T is divided into first to third exhalation paragraphs of “every reality”, “all to yourself” and “twisted” based on the pose length of the accent phrase boundary. The For example, in FIG. 3, a pause length (Breath Paragraph Pause) of 0.35 seconds is predicted between the first and second expiratory paragraphs, and a pause length of 0.15 seconds is predicted between the second and third expiratory paragraphs. ing.

音声収録部１３は、マイクロホン１３ａ等を通じて、収録文章Ｔに対応する発話音声を呼気段落単位で収録する（Ｓ１３）。音声収録部１３は、発話音声の収録時に、発話者Ｓによる発声を視覚的または聴覚的にガイドする。 The voice recording unit 13 records the uttered voice corresponding to the recorded sentence T through the microphone 13a or the like in units of exhalation paragraphs (S13). The voice recording unit 13 visually or audibly guides the utterance by the speaker S when recording the uttered voice.

図４には、発話音声の収録時に発話者Ｓに通知されるガイダンスＧが示されている。図４では、収録文章Ｔが第１〜第３の呼気段落に分割され、収録文章Ｔのヨミとともに示されている。図４（ａ）では、第１の呼気段落に相当する音声区間の発声を促すガイダンスＧ１が示されており、図４（ｂ）では、第２の呼気段落に相当する音声区間の発声促すガイダンスＧ２が示されている。 FIG. 4 shows the guidance G that is notified to the speaker S when the utterance voice is recorded. In FIG. 4, the recorded sentence T is divided into first to third exhalation paragraphs and is shown together with a reading of the recorded sentence T. FIG. 4 (a) shows guidance G1 that prompts utterance in the voice section corresponding to the first exhalation paragraph, and FIG. 4 (b) shows guidance that prompts utterance in the voice section corresponding to the second exhalation paragraph. G2 is shown.

図４に示すように、発話者Ｓは、呼気段落毎に１テイクとして収録を行い、呼気段落に相当する音声区間を一息に発声するように指示される。これにより、収録された発話音声では、呼気段落区切りを高い精度で検出可能となり、後続する音響解析処理の結果を高い精度で得ることができる。 As shown in FIG. 4, the speaker S is instructed to record as one take for each exhalation paragraph and to utter a speech segment corresponding to the exhalation paragraph at a breath. As a result, it is possible to detect the exhalation paragraph break with high accuracy in the recorded uttered speech, and the subsequent acoustic analysis processing result can be obtained with high accuracy.

始終端検出部１４は、発話音声の始終端時点を呼気段落単位で検出する（Ｓ１４）。始端時点は、発話開始から発話音声のパワー時系列が所定の閾値を超えた時点として検出され、終端時点は、発話終了から時間を遡ってパワー時系列が閾値を超えた時点として検出される。 The start / end detection unit 14 detects the start / end time of the uttered voice in units of expiratory paragraphs (S14). The start time point is detected as the time point when the power time series of the uttered voice exceeds the predetermined threshold from the start of the utterance, and the end time point is detected as the time point when the power time series exceeds the threshold value after the end of the utterance.

また、始終端時点は、パワー時系列上で複数の始終端時点の候補を抽出して検出されてもよい。この場合、発話音声に音素ラベリング処理等の音響解析処理を施し、音響特徴量の系列を評価し、最適なマッチングが得られる候補を始終端時点として採用することになる。 Further, the start / end time may be detected by extracting a plurality of start / end time candidates on the power time series. In this case, an acoustic analysis process such as a phoneme labeling process is performed on the uttered speech, a sequence of acoustic feature quantities is evaluated, and a candidate that can obtain an optimal matching is adopted as the start / end point.

制御部１９は、収録文章Ｔに含まれる全ての呼気段落について発話音声を収録したかを判定する（Ｓ１５）。そして、収録を完了した場合に後続の処理に移行し、完了していない場合にステップＳ１３の処理に復帰する。 The control unit 19 determines whether the utterance voice is recorded for all the exhalation paragraphs included in the recorded sentence T (S15). When the recording is completed, the process proceeds to the subsequent process. When the recording is not completed, the process returns to the process of step S13.

無音区間長算出部１５は、収録文章Ｔに含まれる呼気段落間の無音区間Ｓｓの長さを算出する（Ｓ１６）。無音区間Ｓｓの長さは、ステップＳ１２の処理で算出されたアクセント句間のポーズ長に相当する時間長として算出される。また、無音区間Ｓｓの長さは、実際に収録された発話音声に基づいて算出されてもよい。この場合、無音区間Ｓｓの長さは、収録文章Ｔに対応する発話音声を実際に収録し、収録された発話音声中で無音区間Ｓｓを検出して算出される。 The silent section length calculation unit 15 calculates the length of the silent section Ss between the exhalation paragraphs included in the recorded sentence T (S16). The length of the silent section Ss is calculated as a time length corresponding to the pose length between accent phrases calculated in the process of step S12. Further, the length of the silent section Ss may be calculated based on the actually recorded speech sound. In this case, the length of the silent section Ss is calculated by actually recording the uttered voice corresponding to the recorded sentence T and detecting the silent section Ss in the recorded uttered voice.

データ生成部１６は、音声の収録結果、始終端時点の検出結果、および無音区間Ｓｓの算出結果から、収録文章Ｔに対応する音声データを文単位で生成する（Ｓ１７）。データ生成部１６は、呼気段落の音声波形の末尾に無音区間Ｓｓの長さに相当する無音波形を結合し、無音波形の末尾に次の呼気段落の音声波形を結合する。データ格納部１７は、収録文章Ｔに含まれる全ての呼気段落を同様に処理して、収録文章Ｔに対応する音声データを生成する。 The data generation unit 16 generates voice data corresponding to the recorded sentence T in sentence units from the voice recording result, the detection result at the start and end points, and the calculation result of the silent section Ss (S17). The data generation unit 16 combines the sound waveform corresponding to the length of the silent section Ss at the end of the speech waveform of the expiratory paragraph, and combines the sound waveform of the next expiratory paragraph at the end of the soundless shape. The data storage unit 17 processes all the exhalation paragraphs included in the recorded sentence T in the same manner, and generates voice data corresponding to the recorded sentence T.

データ生成部１６は、音声データの付加情報も文単位で生成する（Ｓ１７）。データ生成部１６は、収録文章Ｔに含まれる各呼気段落について、例えば、発話音声の始終端時点、収録文章Ｔの形態素解析結果、後続する呼気段落との間の無音区間Ｓｓの長さを表すデータを生成する。 The data generation unit 16 also generates additional information of the voice data in units of sentences (S17). For each expiratory paragraph included in the recorded sentence T, the data generating unit 16 represents, for example, the start and end times of the uttered speech, the morphological analysis result of the recorded sentence T, and the length of the silent section Ss between the following expiratory paragraphs. Generate data.

図５には、音声データおよび付加情報の生成結果が示されている。図５では、図３に示した収録文章Ｔの音声データおよび付加情報が示されている。 FIG. 5 shows the generation result of the audio data and additional information. FIG. 5 shows audio data and additional information of the recorded sentence T shown in FIG.

音声データは、第１〜第３の呼気段落に相当する音声区間Ｓｖ１〜Ｓｖ３の音声波形と、音声区間Ｓｖ１〜Ｓｖ３の境界をなす無音区間Ｓｓ１、Ｓｓ２からなる。具体的に、第１の音声区間Ｓｖ１の末尾に０．３５秒間の無音区間Ｓｓ１が結合され、無音区間Ｓｓ１の末尾に音声区間Ｓｖ２が結合されている。また、音声区間Ｓｖ２の末尾に０．１５秒間の無音区間Ｓｓ２が結合され、無音区間Ｓｓ２の末尾に音声区間Ｓｖ３が結合されている。 The voice data includes voice waveforms of voice sections Sv1 to Sv3 corresponding to the first to third exhalation paragraphs, and silent sections Ss1 and Ss2 that make boundaries between the voice sections Sv1 to Sv3. Specifically, a silence interval Ss1 of 0.35 seconds is coupled to the end of the first speech segment Sv1, and a speech segment Sv2 is coupled to the end of the silence segment Ss1. Further, a silent section Ss2 of 0.15 seconds is coupled to the end of the speech section Sv2, and a speech section Sv3 is coupled to the end of the silent section Ss2.

また、付加情報は、音声区間Ｓｖ１〜Ｓｖ３の始終端時点、収録文章Ｔの形態素解析結果からなる。具体的に、音声区間Ｓｖ１の始終端時点が０．００秒、０．９０秒、音声区間Ｓｖ２の始終端時点が１．２５秒、２．２５秒、音声区間Ｓｖ３の始終端時点が２．４０秒、３．３０秒となる。また、音声区間Ｓｖ１〜Ｓｖ３の形態素解析結果のヨミが「ａ／ｒａ／ｙｕ／ｒｕ／ｇｅ／ｎ／ｊｉ／ｔｓｕ／ｏ」、「ｓｕ／ｂｅ／ｔｅ／ｊｉ／ｂｕ／ｎ／ｎｏ／ｈｏ／ｏ／ｅ」、「ｎｅ／ｊｉ／ｍａ／ｇｅ／ｔａ／ｎｏ／ｄａ」となる。 Further, the additional information includes the morphological analysis results of the recorded sentence T at the start and end points of the speech sections Sv1 to Sv3. Specifically, the start and end times of the voice section Sv1 are 0.00 seconds and 0.90 seconds, the start and end times of the voice section Sv2 are 1.25 seconds and 2.25 seconds, and the start and end times of the voice section Sv3 are 2. 40 seconds and 3.30 seconds. Further, the morphological analysis results of the speech sections Sv1 to Sv3 are “a / ra / yu / ru / ge / n / ji / tsu / o”, “su / be / te / ji / bu / n / no / ho”. / O / e "and" ne / ji / ma / ge / ta / no / da ".

データ格納部１７は、音声データおよび付加情報を文単位でデータベース１８に格納する（Ｓ１８）。制御部１９は、収録すべき全ての収録文章Ｔについて発話音声を収録したかを判定する（Ｓ１９）。そして、収録を完了した場合に処理を終了し、完了していない場合にステップＳ１１の処理に復帰し、次の収録文章Ｔを取得する。 The data storage unit 17 stores voice data and additional information in the database 18 in sentence units (S18). The control unit 19 determines whether the utterance voice is recorded for all the recorded sentences T to be recorded (S19). Then, when the recording is completed, the process is terminated. When the recording is not completed, the process returns to the process of step S11, and the next recorded sentence T is acquired.

図６には、音素ラベリング処理等の音響解析処理の処理結果が示されている。図６では、発話音声「六月八日」を対象とする音素ラベリング処理の処理結果が示されている。図６（ａ）は、従来技術に基づく処理結果を示し、図６（ｂ）は、音声収録装置１０による処理結果を示している。 FIG. 6 shows a processing result of an acoustic analysis process such as a phoneme labeling process. FIG. 6 shows a processing result of the phoneme labeling process for the speech voice “June 8th”. FIG. 6A shows a processing result based on the prior art, and FIG. 6B shows a processing result by the audio recording device 10.

図６（ａ）では、発話音声「六月八日」のヨミ「ロクガツヨーカ」に対応する音声区間Ｓｖが「ロクガツヨー」に対応する音声区間として検出されている。つまり、発話音声の始終端時点を誤検出したため、発話音声のヨミに対応する適切なラベリング処理が行われていない。 In FIG. 6A, the speech section Sv corresponding to the reading “Rokugatsuyo” of the speech voice “June 8th” is detected as the speech section corresponding to “Rokugatsuyo”. In other words, since the start / end time of the uttered voice is erroneously detected, an appropriate labeling process corresponding to the utterance of the uttered voice is not performed.

一方、図６（ｂ）では、音声区間Ｓｖが「ロクガツヨーカ」に対応する音声区間として検出されている。つまり、発話音声の始終端時点を適切に検出したため、発話音声のヨミに対応する適切なラベリング処理が行われている。 On the other hand, in FIG. 6B, the voice section Sv is detected as a voice section corresponding to “Rokugatsu Yoka”. In other words, since the start and end points of the uttered speech are appropriately detected, an appropriate labeling process corresponding to the utterance of the uttered speech is performed.

前述したように、音声収録装置１０では、呼気段落単位で発話音声を収録することで、発話音声の始終端時点を精度よく検出できるので、発話音声と音声検索用の付加情報を高い精度で対応付けることができる。 As described above, since the voice recording apparatus 10 can accurately detect the start and end points of the speech voice by recording the speech voice for each exhalation paragraph, the speech voice and the additional information for voice search are associated with high accuracy. be able to.

また、音声収録装置１０では、呼気段落に含まれる雑音や不完全な発話による音質の劣化および始終端時点の誤検出を当該呼気段落内に留めることができるため、収録音声全体の利用効率を向上させることができる。 In addition, since the voice recording device 10 can keep deterioration in sound quality due to noise or incomplete speech contained in the exhalation paragraph and erroneous detection at the start and end points in the exhalation paragraph, the use efficiency of the entire recorded speech is improved. Can be made.

以上説明したように、本発明の実施形態に係る音声収録装置１０によれば、収録文章Ｔに対応する音声が呼気段落単位で収録され、音声の始終端時点が検出され、呼気段落間の無音区間Ｓｓの長さが算出される。そして、音声の収録結果、始終端時点の検出結果、および無音区間Ｓｓの算出結果から、収録文章Ｔに対応する音声データおよび付加情報が生成されて文単位で格納される。これにより、音声の始終端時点を精度よく検出でき、音声と付加情報を高い精度で対応付けることができる。結果として、収録音声の品質、付加情報の品質、およびデータベースの構築コストに優れた音声利用システムを提供することができる。 As described above, according to the audio recording device 10 according to the embodiment of the present invention, the audio corresponding to the recorded sentence T is recorded in units of exhalation paragraphs, the start and end points of the audio are detected, and silence between exhalation paragraphs is detected. The length of the section Ss is calculated. Then, voice data and additional information corresponding to the recorded sentence T are generated from the voice recording result, the detection result of the start and end points, and the calculation result of the silent section Ss and stored in sentence units. This makes it possible to accurately detect the start and end points of speech and to associate speech with additional information with high accuracy. As a result, it is possible to provide a voice utilization system that is excellent in recorded voice quality, additional information quality, and database construction cost.

以上、添付図面を参照しながら本発明の好適な実施形態について詳細に説明したが、本発明はかかる例に限定されない。本発明の属する技術の分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本発明の技術的範囲に属するものと了解される。 The preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field to which the present invention pertains can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that these also belong to the technical scope of the present invention.

例えば上記説明では、収録文章が日本語からなる場合について説明したが、収録文章は、日本語以外の言語からなってもよい。 For example, in the above description, the case where the recorded sentence is in Japanese has been described, but the recorded sentence may be in a language other than Japanese.

１０音声収録装置
１１収録文章取得部
１２収録文章分割部
１３音声収録部
１４始終端検出部
１５無音区間長算出部
１６データ生成部
１７データ格納部
１８データベース
１９制御部
Ｔ収録文章
Ｓ発話者
Ｓｖ音声区間
Ｓｓ無音区間
DESCRIPTION OF SYMBOLS 10 Voice recording device 11 Recorded sentence acquisition part 12 Recorded sentence division part 13 Voice recording part 14 Start / end detection part 15 Silent section length calculation part 16 Data generation part 17 Data storage part 18 Database 19 Control part T Recorded sentence S Speaker Sv Voice Section Ss Silent section

Claims

A recorded sentence acquisition unit that acquires recorded sentences representing the sound to be recorded in sentence units,
A recorded sentence dividing unit that divides the recorded sentence into exhalation paragraph units by language analysis processing;
An audio recording unit that records audio corresponding to the recorded sentence in units of the exhalation paragraph;
A start / end detection unit for detecting a start / end time of the sound from the recording result of the sound;
A silent interval length calculation unit for calculating a silent interval length between the exhalation paragraphs included in the recorded sentence;
From the recording result of the voice, the detection result of the start and end points, and the calculation result of the silent section, the voice data corresponding to the recorded sentence, and a data generation unit that generates additional information used for searching the voice,
An audio recording apparatus comprising: the audio data corresponding to the recorded sentence and the data storage unit that stores the additional information in sentence units.

2. The data generation unit according to claim 1, wherein the data generation unit generates at least one of a start and end point of the voice, a morphological analysis result of the recorded sentence, and a silent section length between a subsequent exhalation paragraph as the additional information. Audio recording device.

The voice recording apparatus according to claim 1, wherein the recorded sentence dividing unit divides the recorded sentence into units of the exhalation paragraph by a language analysis method capable of predicting boundary information and pause positions of the voice.

The audio recording apparatus according to claim 1, wherein the audio recording unit records the audio uttered in accordance with guidance indicating the exhalation paragraph unit.

The sound recording device according to claim 1, wherein the silent section length calculation unit calculates a silent section length between the exhalation paragraphs based on a division result of the recorded sentence.

The voice recording device according to claim 1, wherein the silent section length calculation unit calculates a silent section length between the exhalation paragraphs based on a recording result of the voice corresponding to the recorded sentence. .

A step of acquiring sentence-by-sentence sentences representing voice to be recorded;
Dividing the recorded sentences into exhalation paragraph units by language analysis processing;
Recording audio corresponding to the recorded sentence in units of the exhalation paragraph;
Detecting the start and end time of the sound from the recording result of the sound;
Calculating a silent section length between the exhalation paragraphs included in the recorded sentence;
Generating voice data corresponding to the recorded sentence and additional information used for searching the voice from the voice recording result, the detection result of the start and end points, and the calculation result of the silent section;
Storing voice data corresponding to the recorded sentence and the additional information additional information in sentence units.

A step of acquiring sentence-by-sentence sentences representing voice to be recorded;
Dividing the recorded sentences into exhalation paragraph units by language analysis processing;
Recording audio corresponding to the recorded sentence in units of the exhalation paragraph;
Detecting the start and end time of the sound from the recording result of the sound;
Calculating a silent section length between the exhalation paragraphs included in the recorded sentence;
Generating voice data corresponding to the recorded sentence and additional information used for searching the voice from the voice recording result, the detection result of the start and end points, and the calculation result of the silent section;
A program for causing a computer to execute an audio recording method including the step of storing audio data corresponding to the recorded sentence and the additional information additional information in sentence units.