JP5326796B2

JP5326796B2 - Playback device

Info

Publication number: JP5326796B2
Application number: JP2009119513A
Authority: JP
Inventors: 武司村上
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2009-05-18
Filing date: 2009-05-18
Publication date: 2013-10-30
Anticipated expiration: 2029-05-18
Also published as: JP2010266778A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a reproduction device creating a shortened voice signal with little sense of incongruity. <P>SOLUTION: A voice data is blocked for each segment to create voice blocks, and the voice signal of the voice block is discriminated to be a voiced sound, voiceless sound or silence, and a position where the voice starts is set to be a start point. When it is voiced sound, a deemed end point is determined on a time axis by using a presumed approximation line based on peak values of a plurality of voice waveforms which are close to an end point, and when it is voiceless sound, the end point is determined by level detection. Thus, a silence section included in the voiced sound, a voiceless section which can be shortened in terms of audibility and a silence section which are included in the voiceless sound, and a silence section where a voice data does not exist, are cut, and the voice signals are connected together again. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声を構成する有声音と無声音と無音とからなる音を再生もしくは録音再生する装置に関するものであり、特には再生音声を短時間で聞いたり、はっきり聞くための機能に関係し、または音声信号を保存する記憶装置の容量低減に関するもので、複雑な演算を行わずにかつ音質劣化を最小限に抑えることができる再生装置に関する。 The present invention relates to a device for reproducing or recording / reproducing a sound composed of voiced voice, unvoiced sound and silent sound constituting a voice, and particularly relates to a function for listening to a reproduced voice in a short time or for clearly listening, Alternatively, the present invention relates to a capacity reduction of a storage device that stores an audio signal, and relates to a playback device that can minimize deterioration in sound quality without performing complicated calculations.

再生装置に搭載されている機能として、再生音声を短時間で聞いたり、はっきり聞くための話速変換技術にはいろいろな方式が用いられている。まず、単純に時間軸に対して再生速度を速めた場合、再生された音声は速度を速めた比率に比例してその周波数が高くなってしまい、かん高く聞きづらいものになってしまう。また、単純に入力が設定した検知レベル以下の箇所を時間軸上でスキップさせて再生させる場合は、低レベルの領域がない音声信号に対してはスキップさせることができなかった。そのため、この問題を解決するために、特許文献１に開示されているように、音声の入力信号のピッチ周期を抽出し、そのピッチ周期に応じてピッチ２周期分の音声データに重み窓関数をかけて時間軸圧縮を行うデジタル信号処理よる時間軸伸長圧縮技術が採用されている。
また特許文献２に記載されているように、音声区間と無音区間との識別を行い、有声音および無音と判別された音声信号については時間軸を圧縮し、無声音と判別された音声信号については時間軸を圧縮しないか、もしくは時間軸の圧縮比率に比べて低い圧縮率で時間軸圧縮を行う構成になっている。中には、音声区間と無音区間との識別を行い、無音区間は削除させ音声区間をピッチ同期による伸長圧縮制御を行ったり、または再生速度に応じて間引く処理を切り替えるように構成したものもある。 As a function installed in the playback device, various methods are used for speaking speed conversion technology for listening to the playback sound in a short time or for clearly listening. First, when the playback speed is simply increased with respect to the time axis, the frequency of the reproduced voice increases in proportion to the ratio of the increased speed, which makes it difficult to hear. In addition, when a portion below the detection level set by the input is simply skipped on the time axis and reproduced, it is not possible to skip an audio signal having no low level region. Therefore, in order to solve this problem, as disclosed in Patent Document 1, the pitch period of the voice input signal is extracted, and a weight window function is applied to the voice data for two pitch periods according to the pitch period. A time axis expansion and compression technique based on digital signal processing that performs time axis compression over time is employed.
Further, as described in Patent Document 2, the voice section and the silent section are identified, the time axis is compressed for the voice signal determined as voiced and silent, and the voice signal determined as unvoiced sound is recorded. The time axis is not compressed, or the time axis is compressed at a compression rate lower than the time axis compression ratio. Some are configured to identify a voice section and a silent section, delete the silent section, perform expansion / compression control by pitch synchronization of the voice section, or switch the thinning process according to the playback speed. .

特開平1−２３３８３５号公報JP-A-1-233835 特開平７−１２９１９８号公報JP-A-7-129198

このような従来の再生装置において、解決しようとする問題点は、音声区間と無音区間とを一様に時間軸伸長圧縮をする場合は、聞き取りたい部分の明瞭度が悪化したり、時間軸伸長圧縮された無音データを含んだ音声データが生成されてしまう。また、有声音の時間軸伸長圧縮では、周期性のある２ピッチ分の音声信号を１ピッチに圧縮する場合、前部の周期Ｐ１に重み関数Ｗを掛け、後部の周期Ｐ２には反対の重み関数１−Ｗを掛けて、それぞれを加算して１つとするような複雑な信号処理を必要とするために、処理負担が大きく、高速演算を実行する必要があった。 In such a conventional playback device, the problem to be solved is that when the voice segment and the silent segment are uniformly time-axis expanded and compressed, the clarity of the part to be heard deteriorates or the time axis expands. Audio data including compressed silence data is generated. Further, in the time-base expansion compression of voiced sound, when a sound signal of two pitches having periodicity is compressed to one pitch, a weight function W is multiplied by the front period P1, and an opposite weight is applied to the rear period P2. Since complicated signal processing is required to multiply the function 1-W and add each to one, the processing load is large and high-speed computation has to be executed.

本発明は、高速演算を必要とするような複雑な信号処理方法を用いずに、音声として聞き取る上で必要最低限の有声音と、無声音とを残すことで、明瞭度を確保して、違和感の少ない短縮された音声信号を作り出す再生装置を提供することを目的としてなされたものである。 The present invention ensures a clear and uncomfortable feeling by leaving the minimum voiced sound and unvoiced sound for listening as speech without using a complicated signal processing method that requires high-speed computation. The present invention has been made for the purpose of providing a playback apparatus that produces a shortened audio signal with a small amount of noise.

本発明は、音声を構成する有声音と無声音と無音とからなる音を再生もしくは録音再生する装置であって、レベル検出により音声信号データが存在することを検出し始端点を決定する始端点用レベル検出回路と、音声データの一ブロックが有声音か無声音あるいは無音かを識別する音声性質識別回路と、前記音声性質識別回路で識別された音声データを一区切り毎にブロック化して音声固まりを決める音声ブロック化回路と、前記音声性質識別回路で判定された音声データブロックが有声音の場合には、その音声波形の終端に近い複数の波形のピーク値を検出する音声信号ピーク検出回路と、前記複数のピーク値の包絡線によって作る推定近似線からみなし終端点を推定するみなし終端点推定回路と、前記音声性質識別回路で判定された音声データブロックが無声音の場合には、レベル検出により音声データが終了したことを検出する終端点用レベル検出回路と、前記みなし終端点推定回路や終端点用レベル検出回路で決定された始端点と終端点の情報を元に元の音声信号のデータをカットする音声区間カット回路と、カットした後の音声データをつなぎ合わせて生成し直す音声接合回路とを備え、有声音の場合には前記みなし終端点推定回路で決定された終端点から前記始端点用レベル検出回路で決定される次の固まりの始端点までの期間の音声データをカットし、無声音の場合には前記終端点用レベル検出回路で決定された終端点から前記始端点用レベル検出回路で決定される次の固まりの始端点までの期間の音声データをカットし、無音の区間はすべてをカットする制御を行い再びつなぎ合わせたものであり、言葉と言葉の間隙や１つの言葉の中に発生する無音部分および無声音に含まれる聴感上では短縮できる無声部分を削除して、違和感の少ない短縮された音声信号を作り出すことができるという作用を有する。
さらに本発明は、前記音声信号結合回路から出力される音声出力を保存する記憶回路を備えたものであり、音声データをカットすることにより、記憶するために必要な記憶容量を低減できるという作用を有する。 This onset Ming is an apparatus for reproducing or recording and reproducing sound comprising a voiced unvoiced silence constituting the speech, the starting point for determining the starting point is detected that the audio signal data is present by the level detection Level detection circuit, a voice property identification circuit for identifying whether a block of voice data is voiced, unvoiced or silent, and a block of voice data identified by the voice property discrimination circuit for each segment to determine a voice clump An audio signal peak detection circuit that detects peak values of a plurality of waveforms close to the end of the audio waveform when the audio data block determined by the audio block identification circuit and the audio property identification circuit is voiced sound; The speech determined by the speech characterization discriminating circuit, and the speech qualification circuit, which estimates the terminal endpoint from the estimated approximate line formed by the envelopes of a plurality of peak values If the data block is an unvoiced sound, the end point level detection circuit for detecting the end of the voice data by level detection, and the start and end points determined by the deemed end point estimation circuit and the end point level detection circuit A voice segment cut circuit that cuts the data of the original voice signal based on the above information and a voice junction circuit that regenerates by connecting the cut voice data, and in the case of voiced sound, the deemed termination point Audio data in a period from the end point determined by the estimation circuit to the start point of the next block determined by the start point level detection circuit is cut, and in the case of unvoiced sound, the end point level detection circuit determines The voice data in the period from the end point to the start point of the next block determined by the start point level detection circuit is cut, and the silent section is cut all. This is a re-connected one, and it eliminates the gap between words and the silent part that occurs in one word and the silent part that can be shortened in the sense of hearing contained in the unvoiced sound. It has the effect that it can be created.
Furthermore, the present invention is provided with a storage circuit for storing the audio output output from the audio signal coupling circuit, and has the effect of reducing the storage capacity required for storage by cutting the audio data. Have.

本発明の再生装置は、有声音の場合には、音声ブロックの終わりに近い音声信号のピーク値の包絡線によって作る推定近似線からみなし終端点を推定し、その推定したみなし終端点から次の音声ブロックの固まりの始端点までの期間の音声データをカットし、無声音の場合にはレベル検出より決定された終端点から次の音声ブロックの固まりの始端点までの期間の音声データをカットし、無音の区間はすべてをカットする制御を行い再びつなぎ合わせて音声信号を生成するように構成したため、言葉と言葉の間隙や１つの言葉の中に発生する無音部分および無声音に含まれる聴感上では短縮できる無声部分を削除した音声信号を生成できるので、違和感の少ない短縮された音声信号を作り出すことができるという利点がある。
また、作り出された音声信号を、短縮された音声信号の形で記憶回路に保存することにより、記憶するために必要な記憶容量を低減できるという有利な効果が得られる。 In the case of voiced sound, the playback device of the present invention estimates a deemed end point from an estimated approximate line formed by an envelope of the peak value of the speech signal near the end of the speech block, and then estimates the following end point from the estimated assumed end point. Cut the audio data in the period up to the start point of the block of audio blocks, and in the case of unvoiced sound, cut the audio data in the period from the end point determined by level detection to the start point of the next block of audio blocks, The silence section is controlled so as to cut everything, and is connected again to generate a speech signal, shortening the gap between words and the silence contained in one word and the audibility contained in the silent sound. Since the voice signal from which the unvoiced portion that can be generated is deleted can be generated, there is an advantage that a shortened voice signal with less sense of incongruity can be created.
Further, by storing the produced audio signal in the storage circuit in the form of a shortened audio signal, an advantageous effect that the storage capacity necessary for storage can be reduced is obtained.

実施の形態１にかかる再生装置の信号処理部分のブロック構成図FIG. 3 is a block diagram of a signal processing portion of the playback apparatus according to the first embodiment. 実施の形態１にかかる再生装置の音声識別され区間割された音声信号波形を示す図The figure which shows the audio | voice signal waveform by which the audio | voice identification of the reproducing | regenerating apparatus concerning Embodiment 1 was identified and divided into sections 実施の形態１にかかる再生装置の信号合成後の音声出力波形を示す図The figure which shows the audio | voice output waveform after the signal synthesis | combination of the reproducing | regenerating apparatus concerning Embodiment 1. FIG. 実施の形態１にかかる再生装置のみなし終端点を推定算出するための近似線関係図Approximate line relationship diagram for estimating and calculating the end point of the playback apparatus according to the first embodiment only

以下、本発明の再生装置を実施すための最良の形態について、図１から図４を用いて詳細に説明する。 Hereinafter, the best mode for carrying out the reproducing apparatus of the present invention will be described in detail with reference to FIGS.

（実施の形態１）
図１は本発明の第１の実施の形態における再生装置の信号処理部分のブロック構成図を示し、図２は同じく第１の実施の形態における音声識別され区間割された音声信号波形図、
図３は同じく第１の実施の形態における信号合成後の音声出力波形、図４は同じく第１の実施の形態におけるみなし終端点を推定算出するための近似線関係図である。
図１において、１は音声入力、２は始端点用レベル検出回路、３は音声性質識別回路、４は音声ブロック化回路、５は有声音の信号経路、６は無声音の信号経路、７は無音の信号経路、８は音声信号ピーク検出回路、９はみなし終端点推定回路、１０は終端点用レベル検出回路、１１は音声区間カット回路、１２は音声信号接合回路、１３は音声出力で構成している。 (Embodiment 1)
FIG. 1 is a block diagram of a signal processing portion of a playback apparatus according to the first embodiment of the present invention, and FIG. 2 is a voice signal waveform diagram in which voice is identified and divided into sections according to the first embodiment,
3 is an audio output waveform after signal synthesis in the first embodiment, and FIG. 4 is an approximate line relationship diagram for estimating and calculating a deemed end point in the first embodiment.
In FIG. 1, 1 is an audio input, 2 is a start point level detection circuit, 3 is an audio property identification circuit, 4 is an audio block circuit, 5 is a voiced signal path, 6 is an unvoiced signal path, and 7 is silent. , 8 is an audio signal peak detection circuit, 9 is a deemed termination point estimation circuit, 10 is a termination point level detection circuit, 11 is an audio segment cut circuit, 12 is an audio signal junction circuit, and 13 is an audio output. ing.

以上のように構成された第１の実施の形態における信号処理部分のブロック構成図について、図２、図３と図４を付加して以下その動作について説明する。 The block configuration diagram of the signal processing portion in the first embodiment configured as described above will be described below with reference to FIGS. 2, 3 and 4. FIG.

まず、再生装置全体から、本発明の特徴となる信号の処理部分を抜き出したものが、図１の信号処理部分のブロック構成図である。また時間軸上に音声信号波形の一例を示したものが、図２の音声識別され区間割された音声信号波形図である。アナログ信号で構成される音声信号が音声入力１から入力される。始端点用レベル検出回路２は、この音声信号のレベル検出を行っており、決められたしきい値以上のレベルを検知した場合、音声信号の入力があったと判断する。入力があったと判断した時点で、この時間軸上に、始端点としてマークする。図２に示す音声信号波形図のポイントＡ、ポイントＤ、ポイントＧに相当する。この時に設定するしきい値レベルにより、無音として判断するレベルが変化することとなり、しきい値レベルを上げ過ぎると音声の開始部分で頭切れを起こしてしまう可能性があるため、検出するための適切なレベルに設定することが重要なファクターとなる。この時点では音声信号の始まり、つまり始端点は確定できるが、その後に続く音声信号が有声音なのか、無声音なのかはわかっていない。始端点用レベル検出回路２で始端点を付けられた音声信号は、音声性質識別回路３へ送られる。音声性質識別回路３では、音声信号が有声音か無声音あるいは無音かを識別するが、いろいろな周波数成分で構成される音声信号の１波形毎に判断していては、処理が煩雑となってしまい高い信号処理能力も必要となってしまう。本発明は、言葉と言葉の間隙や１つの言葉の中に発生する無音部分を加工することを前提としているため、音声ブロック化回路４で、音声データを一区切り毎にブロック化して音声固まりを決定する。音声データをブロック化する方法として、いろいろな方法あるが、簡易な方法としては信号レベルの有無で分割する方法がある。また周波数成分とレベルおよび音声信号の包絡線形状により高精度でブロック化すると、音の頭切れなどの発生を防ぐことができる。
音声性質識別回路３では、そのブロック化された音声信号の固まりの周波数成分や音声レベルを判断して、音声データの一ブロックが有声音か無声音あるいは無音かを識別する。
図２の上段部に示す音声ブロック化回路による分類のように、ポイントＡまでは無音、ポイントＡからＤは有声音、ポイントＤからＧまでは無声音、ポイントＧからＫまでを有声音というようにまず分類を実施する。無音を判別するには、音声信号の有る無しを判断することで無音を認識できる。有声音か無声音かを判断するためには、その音声信号を構成する周波数と音声レベルとで判断が必要であり、一般的に高い周波数の低いレベルの波形が連続している場合は無声音であり、レベル変動を伴って低い周波数の波形で構成されているものは、有声音である。無声音、有声音を識別する方法に関しては、非常にたくさんの技術的資料や特許などが公開されており、本発明の主目的ではないため、ここでは省略する。 First, a block diagram of the signal processing portion of FIG. 1 is obtained by extracting a signal processing portion that is a feature of the present invention from the entire playback apparatus. An example of the audio signal waveform on the time axis is the audio signal waveform diagram obtained by audio identification and divided into sections in FIG. An audio signal composed of an analog signal is input from the audio input 1. The start point level detection circuit 2 detects the level of the audio signal, and determines that an audio signal has been input when a level equal to or higher than a predetermined threshold is detected. When it is determined that there is an input, the start point is marked on this time axis. This corresponds to point A, point D, and point G in the audio signal waveform diagram shown in FIG. Depending on the threshold level set at this time, the level judged as silence will change, and if the threshold level is increased too much, there is a possibility that the head will be cut off at the beginning of the voice. Setting an appropriate level is an important factor. At this time, the start of the audio signal, that is, the start point can be determined, but it is not known whether the audio signal that follows is a voiced sound or an unvoiced sound. The audio signal to which the start end point is attached by the start end point level detection circuit 2 is sent to the sound property identification circuit 3. The voice property identification circuit 3 discriminates whether the voice signal is voiced, unvoiced or silent, but the processing becomes complicated if it is determined for each waveform of the voice signal composed of various frequency components. High signal processing capability is also required. Since the present invention is based on the premise of processing a gap between words and a silent portion generated in one word, the voice block circuit 4 blocks voice data for each segment and determines a voice clump. To do. There are various methods for blocking audio data, and a simple method is to divide the audio data based on the presence or absence of a signal level. Further, if the frequency component, the level, and the envelope shape of the audio signal are used to make the block with high accuracy, it is possible to prevent the head from being cut off.
The speech property identification circuit 3 determines whether a block of speech data is a voiced sound, an unvoiced sound, or a silent sound by judging the frequency components and the sound level of the block of the sound signal that has been blocked.
As in the classification by the voice blocking circuit shown in the upper part of FIG. 2, the sound is silent until point A, the voice is sound from point A to D, the voiceless sound is from point D to G, and the voice is sound from point G to K. First, classification is performed. In order to determine silence, silence can be recognized by determining whether there is an audio signal. In order to judge whether the sound is voiced or unvoiced, it is necessary to make a judgment based on the frequency and the sound level that make up the sound signal. A voiced sound is composed of a low-frequency waveform with a level fluctuation. With regard to the method for discriminating unvoiced sounds and voiced sounds, a great amount of technical data and patents have been published and are not the main purpose of the present invention, so they are omitted here.

音声性質識別回路３により、有声音、無声音、無音の３種類に分類された音声データは、それぞれ異なった処理に進む。有声音の場合、その音声波形の終端に近い複数の波形のピーク値を検出する音声信号ピーク検出回路８と、複数のピーク値の包絡線によって作る推定近似線からみなし終端点を推定するみなし終端点推定回路９により、次の３つに分ける。１つ目は、ブロック化回路では有声音と識別されたが無声音の性質を持つポイントＡからＢ、ポイントＧからＨの区間と、２つ目は、完全に有声音として分類されるポイントＢからＣ、ポイントＨからＪの区間と、３つ目は、みなし終端点としてマークされたポイントＣとポイントＪにより区切られたポイントＣからＤ、ポイントＪからＫの区間から成る無音部分とに細分化される。ここで、みなし終端点の決定方法については、本発明の要旨であるため、後ほど詳細に述べる。無声音の場合、終端点用レベル検出回路１０はレベル検出により音声データが終了したことを検出し、終端点Ｆをマークする。これにより、無声音に含まれる不要と判断できる無声音であるポイントＤからＥの区間、完全に無声音として分類できるポイントＥからＦの区間、無音とみなすことができるＦからＧの区間に細分化できる。 The voice data classified into three types of voiced sound, unvoiced sound, and silent sound by the voice property identification circuit 3 proceeds to different processes. In the case of voiced sound, a speech signal peak detection circuit 8 that detects peak values of a plurality of waveforms close to the end of the speech waveform, and a deemed termination that estimates a deemed termination point from an estimated approximate line formed by an envelope of the plurality of peak values The point estimation circuit 9 divides into the following three. The first is a section from point A to B, which is identified as voiced sound in the blocking circuit but has the nature of unvoiced sound, and the second is from point B which is completely classified as voiced sound. C, segment from point H to J, and the third segmented into point C marked as a deemed end point and silent part consisting of segment from point C to D and point J to K delimited by point J Is done. Here, since the method of determining the deemed end point is the gist of the present invention, it will be described in detail later. In the case of an unvoiced sound, the end point level detection circuit 10 detects the end of the voice data by level detection and marks the end point F. Thereby, it can be subdivided into a section from point D to E which is an unvoiced sound that can be determined as unnecessary included in the unvoiced sound, a section from point E to F that can be completely classified as unvoiced sound, and a section from F to G that can be regarded as silent.

有声音ブロックの細分化と無声音ブロックの細分化との手法が異なるのは、その音声信号の特性に起因する。無声音はレベルが低く、周波数の高い連続波形で構成される。この波形の終端部は緩やかにレベル低下をしながら無音へと収束していく。終端点の判断は曖昧であり、しきい値によるレベル検出により終端点Ｆを決める必要がある。また低レベルでなだらかに収束しているため、音がなくなった直後から時間軸でデータカットを行い、次の音声ブロックの頭に接続しても、比較的に違和感のない音声を作ることができるためである。 The difference between the subdivision of the voiced sound block and the subdivision of the unvoiced sound block is due to the characteristics of the sound signal. Unvoiced sounds are composed of continuous waveforms with low levels and high frequencies. The end of this waveform converges to silence while gradually decreasing the level. The determination of the end point is ambiguous, and it is necessary to determine the end point F by level detection using a threshold value. In addition, since it converges gently at a low level, even if the data is cut on the time axis immediately after the sound disappears and connected to the head of the next voice block, it is possible to create a relatively uncomfortable voice. Because.

音声区間カット回路１１は、みなし終端点推定回路９や終端点用レベル検出回路１０で決定された始端点と終端点の情報を元に音声信号のデータをカットする操作を行う。詳細な分類を行った後のブロックは、有声音、有声音に含まれる無声音部分、有声音に含まれる無音部分、無声音、無声音に含まれる無声音的な部分、完全に音を含まない無音となる。ここで、有声音に含まれる無音部分であるポイントＣからＤの区間とポイントＪからＫの区間、無声音に含まれるカットしても音質におおきな影響がでない不要な無声音部分であるポイントＤからＥの区間、無声音に含まれる無音部分であるポイントＦからＧの区間の音声データをカットする。そのカットした音声信号データを受けて、音声信号接合回路１２は、終端点として決められた波高レベルが０の位置と始端点として決められた波高レベルが０の位置とを直結させて音声データを接合させる。図３が、図２の音声信号波形図をカットし接続した音声出力の一例である。
本第１の実施例では、有声音内に含まれる無声音部分と有声音間のわずかな無音部分についてはカットが行われていないが、更に短縮された音声信号を作り出すために検出しカットすることも可能である。この場合、図２に記載しているポイントＭからＢの区間、ポイントＮからＨの区間のカットを行うこととなる。
ここで、図４を用いて、先に述べた有声音の場合のみなし終端点の決定方法について説明する。無声音の場合は低いレベルで緩やかにレベルの減衰から音声が収束するが、有声音の場合は、それとは大きく異なった減衰となり、特に標準的な有声音のほとんどの減衰は、以下の特徴を持っている。まず減衰を開始する複数のピーク値の推定近似線の延長線上と時間軸との交点の角度をα、更に減衰が進んで終端側に寄った複数のピーク値によって作られる近似線の延長線上と時間軸との交点の角度をβとすると、β＞αとなっている。つまり２段目の推定近似線の前に現れる減衰を開始した箇所の推定近似線の延長線上と時間軸との交点以降に音声信号が残ってしまうことはない。この交点をみなし終端点として、先の音声信号をカットするポイントとして使用すると、違和感のない合成後の音を再現できる。なぜこのポイントが良いのかは明確にはわからないが、多数の音声信号の加工を行い、試聴を繰り返し行った結果からその効果を確認した。また、複数の音声信号の波形を実際に確認し、推定近似線を形成するピーク値は、３〜５ポイントを用いた場合、良好な終端点を求めやすいデータとなった。
また、音声信号はその性質上、それを構成する周波数成分およびピーク値は、リニアではなく、振られた成分を持っている。そのため、厳格に波形のピーク点を追い求めて近似線を引いた場合、求めようとするみなし終端点を見つけられない場合が考えられる。このように、みなし終端点を算出する時のピーク値に対しては、あいまい度を含んだ判定方法が必要となる。
本第１の実施形態の説明では、図形による推定近似線を用いて行ったが、その具体的な近似の作成方法として、１次線形補間による近似線の作成を用いることにより、デジタル的な処理で終端点をもとめることもできる。 The voice section cut circuit 11 performs an operation of cutting voice signal data based on the information of the start and end points determined by the deemed end point estimation circuit 9 and the end point level detection circuit 10. The blocks after detailed classification are voiced sounds, unvoiced parts included in voiced sounds, unvoiced parts included in voiced sounds, unvoiced sounds, unvoiced parts included in unvoiced sounds, and completely silent. . Here, the section from point C to D which is a silent part included in the voiced sound, the section from point J to K, the point D to E which is an unnecessary silent part which does not have a significant effect on the sound quality even if it is included in the unvoiced sound. The voice data of the section from point F to G, which is a silent part included in the silent sound, is cut. In response to the cut audio signal data, the audio signal joining circuit 12 directly connects the position where the wave height level determined as the end point is 0 and the position where the wave height level determined as the start point is 0, to the audio data. Join. FIG. 3 is an example of an audio output obtained by cutting and connecting the audio signal waveform diagram of FIG.
In this first embodiment, the unvoiced sound part included in the voiced sound and the slight silence part between the voiced sounds are not cut, but are detected and cut to produce a further shortened sound signal. Is also possible. In this case, the section from point M to B and the section from point N to H shown in FIG. 2 are cut.
Here, with reference to FIG. 4, a method for determining the end point only for the voiced sound described above will be described. In the case of unvoiced sound, the sound converges from the level attenuation at a low level, but in the case of voiced sound, the sound is significantly different from that, and most attenuation of standard voiced sound has the following characteristics. ing. First, the angle of the intersection of the estimated approximate line of the peak values at which attenuation begins and the time axis intersect with α, and the extended line of the approximate line created by the multiple peak values approaching the terminal side as attenuation further proceeds If the angle of the intersection with the time axis is β, β> α. That is, the audio signal does not remain after the intersection of the time axis and the extension line of the estimated approximate line where the attenuation starts before the estimated approximate line in the second stage. If this intersection is regarded as a termination point and used as a point for cutting the previous audio signal, it is possible to reproduce a synthesized sound without any sense of incongruity. I don't know why this point is good, but after processing many audio signals and repeatedly listening to them, the effect was confirmed. In addition, when the waveforms of a plurality of audio signals were actually confirmed and the peak value forming the estimated approximate line was 3 to 5 points, it was data for which a good end point was easily obtained.
Further, due to the nature of an audio signal, the frequency components and peak values constituting it are not linear but have waved components. Therefore, when the peak point of the waveform is strictly pursued and an approximate line is drawn, it is considered that the assumed end point to be obtained cannot be found. Thus, a determination method including the ambiguity is required for the peak value when calculating the deemed end point.
In the description of the first embodiment, an estimated approximate line by a figure is used. However, as a specific approximation creation method, digital processing is performed by using creation of an approximate line by primary linear interpolation. You can also find the end point with.

本発明の再生装置は、有声音の場合にはみなし終端点推定回路で決定された終端点から始端点用レベル検出回路で決定される次の固まりの始端点までの期間の音声データをカットし、無声音の場合には終端点用レベル検出回路で決定された終端点から前記始端点用レベル検出回路で決定される次の固まりの始端点までの期間の音声データをカットし、無音の区間はすべてをカットする制御を行い再びつなぎ合わせることにより、言葉と言葉の間隙や１つの言葉の中に発生する無音部分および無声音に含まれる聴感上では短縮できる無声部分を削除して、違和感の少ない短縮された音声信号を作り出すことができ、話速変換としての早聞き用として用いることや、その音声データを記録する装置においては、記憶するために必要な記憶容量を低減する用途にも適用できる。 In the case of voiced sound, the playback device of the present invention cuts audio data for a period from the end point determined by the deemed end point estimation circuit to the start point of the next block determined by the start point level detection circuit. In the case of unvoiced sound, the audio data in the period from the end point determined by the end point level detection circuit to the start point of the next block determined by the start point level detection circuit is cut, and the silent section is By controlling everything to cut and reconnecting them, the gap between words and the silent part generated in one word and the unvoiced part included in the unvoiced sound can be deleted to reduce the sense of incongruity. In a device that can be used for quick listening as speech speed conversion and that records the voice data, the storage capacity required for storage can be reduced. It can also be applied to applications.

１音声入力
２始端点用レベル検出回路
３音声性質識別回路
４音声ブロック化回路
５有声音の信号経路
６無声音の信号経路
７無音の信号経路
８音声信号ピーク検出回路
９みなし終端点推定回路
１０終端点用レベル検出回路
１１音声区間カット回路
１２音声信号接合回路
１３音声出力 DESCRIPTION OF SYMBOLS 1 Voice input 2 Start point level detection circuit 3 Voice property identification circuit 4 Voice blocking circuit 5 Voiced signal path 6 Unvoiced signal path 7 Silent signal path 8 Voice signal peak detection circuit 9 Deemed termination point estimation circuit 10 Termination Point level detection circuit 11 Voice segment cut circuit 12 Audio signal joining circuit 13 Audio output

Claims

A device for reproducing or recording / reproducing a sound composed of voiced sound, unvoiced sound and silent sound that constitutes a sound,
A level detection circuit for a starting point that detects the presence of audio signal data by level detection and determines a starting point;
A voice property identification circuit for identifying whether one block of voice data is voiced, unvoiced or silent;
A voice blocking circuit that blocks voice data identified by the voice property identification circuit for each segment and determines a voice chunk;
When the voice data block determined by the voice property identification circuit is voiced sound, a voice signal peak detection circuit that detects peak values of a plurality of waveforms near the end of the voice waveform;
A deemed end point estimation circuit for estimating a deemed end point from an estimated approximate line formed by an envelope of the plurality of peak values;
When the voice data block determined by the voice property identification circuit is an unvoiced sound, a termination point level detection circuit that detects the end of the voice data by level detection;
A starting point determined by the level detecting circuit for the starting point, the considered termination point estimating circuit and the voice interval cutting circuit for cutting the data of the audio signal based on information of the final end point that is determined by the level detecting circuit terminating point When,
It is equipped with an audio junction circuit that regenerates by connecting the audio data after cutting,
In the case of voiced sound, the speech data of the period from the start point determined by the start point level detection circuit to the end point determined by the deemed end point estimation circuit, and determined by the deemed end point estimation circuit The voice data in the period from the next end point to the start point of the next block determined by the start point level detection circuit is cut, and in the case of unvoiced sound, the start point determined by the start point level detection circuit To the end point determined by the end point level detection circuit and the next end point determined by the start point level detection circuit from the next end point determined by the end point level detection circuit. of cutting the audio data time to the starting point of the mass, Ru joined again performs control silent section to cut all playback devices.

A storage circuit for storing the audio output output from the audio signal combining circuit;
2. The reproducing apparatus according to claim 1, wherein the storage capacity necessary for storing the audio data is reduced by cutting the audio data.

2. The reproducing apparatus according to claim 1, wherein the deemed terminal point estimation circuit obtains a deemed terminal point using an approximate line estimated by linear linear interpolation.