JPS6310437B2

JPS6310437B2 -

Info

Publication number: JPS6310437B2
Application number: JP56035710A
Authority: JP
Inventors: Yoshiteru Mifune; Hidekazu Tsuboka; Satoru Kabasawa
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1981-03-11
Filing date: 1981-03-11
Publication date: 1988-03-07
Also published as: JPS57148799A

Description

【発明の詳細な説明】本発明は音声の語頭検出方式に関するものであ
る。具体的には例えば入力信号パターン系列の電
力値に基づく音声区間の切り出しおよび音声区間
の系列に対してパターンに基づく音韻分類を行な
つた後に、音声区間の音韻系列の並びによつて音
声の語頭検出を行なうことにより、音声の語頭に
発生する雑音（外界雑音、唇、歯、舌、唾による
雑音）を除去し、かつ語頭の無声子音の確保を図
り、音声の語頭検出精度を向上させ、音声認識装
置における認識率の改善を図ることを目的とする
ものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a method for detecting the beginning of a speech word. Specifically, for example, after cutting out a speech section based on the power value of an input signal pattern sequence and performing phoneme classification based on the pattern on the series of speech sections, the beginning of a speech is detected based on the arrangement of the phoneme sequence of the speech section. By doing this, noise that occurs at the beginning of speech words (external noise, noise caused by lips, teeth, tongue, and saliva) is removed, and the voiceless consonants at the beginning of words are secured, improving the accuracy of detecting the beginning of speech words. The purpose is to improve the recognition rate of recognition devices.

今、入力信号パターン系列は式(1)のように特徴
ベクトルの系列として表わされているものとす
る。 It is now assumed that the input signal pattern sequence is expressed as a sequence of feature vectors as shown in equation (1).

X₁，X₂，……，X_N ……(1) 各々のX_i，ｉ＝１，……，Ｎはそれぞれｍ次元
のベクトルであつて、 X_i＝（x_i1，……，x_in）と表わされる。ここで、特徴ベクトルとしては、
例えばｍチヤンネルのバンドパスフイルタの出力
x₁（ｔ），……，x_j（ｔ），……，x_n（ｔ）を時間標
本化したものと考えることができる。 X ₁ , X ₂ , ..., X _N ... (1) Each X _i , i=1, ... _, N is _an m-dimensional vector, _in ). Here, the feature vector is
For example, the output of an m-channel bandpass filter
It can be considered that x ₁ (t), ..., x _j (t), ..., x _n (t) are time-sampled.

また、特徴ベクトルで表わされる信号の区間
（時間標本化区間）をフレームと表現している。 Furthermore, a signal interval (time sampling interval) represented by a feature vector is expressed as a frame.

信号パターン系列の電力値系列は、式(2)で示さ
れる各フレームの電力値の系列であり、 PW₁，PW₂，……，PW_N ……(2) 式(1)で表わされる信号パターン系列がフイルタ
バンクによつて得られる場合にはベクトルX_iの電
力値は、 PW_i＝（_n 〓^j=1 x_ij ²）^1/2 あるいは PW_i＝_n 〓^j=1 ｜x_ij｜ ……(3) と定義される。 The power value series of the signal pattern series is a series of power values of each frame expressed by equation (2), and PW ₁ , PW ₂ , ..., PW _N ... (2) The signal expressed by equation (1) If the pattern sequence is obtained by a filter bank, the power value of vector X _i is PW _i = ( _n 〓 ^j=1 x _ij ² ) ^1/2 or PW _i = _n 〓 ^j=1 | x _ij | ...(3) is defined as.

ただし、X_i＝（x_i1，……，x_in）である。 However, X _i =(x _i1 , ..., x _in ).

従来の音声信号パターン系列の語頭検出方式
は、そのほとんどのものが式(3)で示したような入
力信号パターン系列の電力値のみに基づいてお
り、第１図に示すように、入力信号パターン系列
の電力値PWが、はじめて音声信号の最小電力値
PVM_iＮ以上となるフレームが一定フレーム長Ｌ
１以上連続する場合を語頭とし、この語頭以後の
系列において、該電力値PWが、はじめて音声信
号の最小電力値PVM_iＮ以下となるフレームが一
定フレーム長Ｌ２以上連続する場合を語尾とし
て、入力信号パターン系列の音声区間Ｖを検出し
ていた。 Most of the conventional word beginning detection methods for audio signal pattern sequences are based only on the power value of the input signal pattern sequence as shown in equation (3). The power value PW of the series is the minimum power value of the audio signal for the first time.
PVM _i A frame with N or more has a constant frame length L
The case where 1 or more consecutive frames are input as the beginning of a word, and the case where the power value PW is less than or equal to the minimum power value PVM _i N of the audio signal for the first time in the series after the beginning of the word is considered as the end of the word. Voice section V of the signal pattern series was detected.

さらに音声の語頭の無声子音を確保することを
目的とするものでは、第２図に示すように、音声
区間の切り出しを２つの電力値レベルQ₁，Q₂に
よつて行なうものがあり、入力信号パターン系列
の電力値PWが、はじめて閾値Q₂以上となるフレ
ームが一定フレーム長Ｌ１′以上連続する場合を
語頭候補とし、語頭候補の直前で、該電力値PW
が閾値Q₁と交わる時点を語頭としていた。 Furthermore, as shown in Fig. 2, there is a system whose purpose is to secure unvoiced consonants at the beginning of speech words, in which the segmentation of speech sections is performed using two power value levels Q ₁ and Q ₂ . A case where the power value PW of the signal pattern series is equal to or higher than the threshold _Q2 for the first time continues for a certain frame length L1' or more is considered a word-initial candidate, and the power value PW is set immediately before the word-initial candidate.
The point at which Q intersects with the threshold Q ₁ was taken as the beginning of the word.

このような、信号パターン系列の電力値のレベ
ルと発生区間長にのみ基づく語頭検出方式におい
ては、音声の発声時点に生じる雑音（外界の騒音
振動音あるいは音声の発声の準備に伴う、歯、
唇、舌などがぶつかる音や唾による雑音）を語頭
として検出する場合があり、このような雑音を除
去するために電力値の閾値レベルを上げるか発生
区間長を長く設定すると音声の語頭の無声子音を
確保することが不可能となり、語頭の検出精度は
きわめて低いものとなる。 In such a word beginning detection method based only on the power level of the signal pattern sequence and the length of the generation interval, it is difficult to detect noise that occurs at the time of speech production (such as external noise, vibrations, teeth, etc. accompanying the preparation for speech production).
In some cases, sounds such as lips and tongues colliding or noises caused by saliva may be detected as the beginning of a word.In order to remove such noise, increasing the power threshold level or setting a longer generation interval length may detect silence at the beginning of a word. It becomes impossible to secure consonants, and the accuracy of detecting the beginning of a word becomes extremely low.

このため従来の音声認識装置は、正確な音声区
間の切り出しができず、認識率は低いものであつ
た。 For this reason, conventional speech recognition devices have been unable to accurately cut out speech sections, resulting in low recognition rates.

上述したように音声パターン系列の語頭検出
を、信号パターン系列の電力値のみに基づいて行
なうと、音声の発声時点における雑音と無声子音
の判別が困難となり検出精度が低減する。そこで
式(1)で示される信号パターン系列がフイルタ・バ
ンクによつて得られる場合には、各フレームのパ
ターンに基づく周波数情報を用いることが考えら
れる。 As described above, if the beginning of a speech pattern series is detected based only on the power value of the signal pattern series, it becomes difficult to distinguish between noise and unvoiced consonants at the time the speech is uttered, and the detection accuracy decreases. Therefore, if the signal pattern series expressed by equation (1) is obtained by a filter bank, it is conceivable to use frequency information based on the pattern of each frame.

本発明はこの点に着目したもので、以下にその
実施例と共に説明する。 The present invention focuses on this point, and will be described below along with examples thereof.

式(2)で示した電力値系列以外に、式(4)で表わさ
れる低域偏り値系列、 PL₁，PL₂，………，PL_N (4) および、式(5)で表わされる電力偏り値系列、 PD₁，PD₂，………，PD_N (5) を使用するものとする。するとフイルタ・バンク
がｍチヤンネルで構成され、中心周波数w_coが、 w_c1＜w_c2＜………＜w_cj＜………＜w_cn である場合には、前記特徴ベクトルX_iの低域偏り
値は、 PL_i＝_k 〓^j=1 ｜x_ij｜ (6) ｋ＜ｍ／２で定義され、前記特徴ベクトルＸの電力偏り値
は、式(3)で定義される電力値PW_iを用いて、 PD_i＝ｊ such that min｛ｊ｜_j 〓^j=1 ｜x_ij｜＞PW_i／２｝と定義される。 In addition to the power value series shown in equation (2), the low-frequency bias value series shown in equation (4), PL ₁ , PL ₂ , ......, PL _N (4), and the equation (5) shown in Assume that the power bias value series PD ₁ , PD ₂ , ......, PD _N (5) is used. Then, if the filter bank is composed of m channels and the center frequency w _co is w _c1 < w _c2 < ...... < w _cj < ...... < w _cn , then the low frequency of the feature vector X _i The bias value is defined as PL _i = _k 〓 ^j=1 | x _ij | (6) k<m/2, and the power bias value of the feature vector X is the power value PW _i defined by equation (3). PD _i =j such that min {j| _j 〓 ^j=1 |x _ij |>PW _i /2}.

つまり式(3)で示した電力値以外に、周波数情報
として式(6)で示した低域偏り値、および式(7)で示
した電力偏り値の３つのパラメータに基づいて音
声パターン系列の語頭検出を行なうものとする。 In other words, in addition to the power value shown in equation (3), the voice pattern sequence is calculated based on three parameters: the low frequency bias value shown in equation (6) as frequency information, and the power bias value shown in equation (7). It is assumed that word beginning detection is performed.

また音声パターン系列の各フレームの大まかな
音韻分類は、該電力値、低域偏り値、電力偏り値
の３つのパラメータによつて行なうことができ
る。ここで大まかな音韻分類とは有声音、無声子
音、無音に分類することを示す。有声音は母音
（｜ａ｜，｜ｉ｜，｜ｕ｜，｜ｅ｜，｜ｏ｜）、有声子
音（｜ｍ｜，｜ｎ｜，｜ｂ｜，｜ｇ｜，｜ｄ｜，｜ｒ
｜，｜Ｚ｜）、半母音（｜ｚ｜，｜ｗ｜）および撥
音（｜ｘ｜，うん音）であり、無声子音は（｜ｃ
｜，｜ｓ｜，｜ｈ｜，｜ｐ｜，｜ｔ｜，｜ｋ｜）およ
び促音（｜Ｑ｜、つまり音）であり、無音は音韻
が発声されていない状態である。 Rough phoneme classification of each frame of the speech pattern series can be performed using three parameters: the power value, the low frequency bias value, and the power bias value. Here, the rough phonological classification refers to classification into voiced sounds, voiceless consonants, and silent sounds. Voiced sounds are vowels (|a|, |i|, |u|, |e|, |o|), and voiced consonants (|m|, |n|, |b|, |g|, |d|, | r
|, |Z|), semivowels (|z|, |w|) and pellicles (|x|, un), and voiceless consonants are (|c
|, |s|, |h|, |p|, |t|, |k|) and consonants (|Q|, that is, sounds), and silence is a state in which no phoneme is uttered.

第４図に、音声パターン系列の各フレームにお
ける、大まかな音韻分類と、電力値PW、電力偏
り値PDおよびおよび低域偏り値PLとの対応関係
を示す。同図においてフイルタ・バンクは第３図
に示したような中心周波数と帯域幅をもつ20チヤ
ンネルのフイルタ・バンクを用い、低域偏り値
PLは、式(6)においてＫ＝３（低域３チヤンネル分
の和）としたものである。同図ａは、低域偏り値
PL≦0.05×電力値PWの場合の大まかな音韻分類
を示し、同図ｂは低域偏り値PL＞0.05×電力値
PWの場合を示す。 FIG. 4 shows the correspondence between the rough phoneme classification and the power value PW, power bias value PD, and low frequency bias value PL in each frame of the speech pattern series. In the figure, a 20-channel filter bank with the center frequency and bandwidth shown in Figure 3 is used, and the low frequency bias value is
PL is obtained by setting K=3 (sum of three low-frequency channels) in equation (6). In the same figure, a is the low frequency bias value.
The rough phonological classification is shown when PL≦0.05×power value PW, and b of the same figure shows the low frequency bias value PL>0.05×power value.
The case of PW is shown.

そこで音声パターン系列の語頭検出を、はじめ
は信号パターン系列の電力値に基づく音声区間の
切出しを行ない、次はその音声区間の信号パター
ン系列の各フレームを電力値PW_i、低域偏り値
PL_i、電力偏り値PD_iに基づいて大まかな音韻分
類を行ない、最後に音韻系列のならびにもとづい
て行なうものとする。上記のような語頭検出を行
なうと日本語音声の音韻のならびにおける特性と
音声の発声時点における雑音の周波数および発生
区間の特性によつてより精度の高い語頭検出を行
なうことができる。 Therefore, when detecting the beginning of a speech pattern series, we first cut out a speech section based on the power value of the signal pattern series, and next we extract each frame of the signal pattern series in that speech section using the power value PW _i and the low frequency bias value.
Rough phoneme classification is performed based on PL _i and power bias value PD _i , and finally based on the arrangement of phoneme sequences. When the beginning of a word is detected as described above, it is possible to detect the beginning of a word with higher accuracy based on the characteristics of the phoneme sequence of Japanese speech, the frequency of the noise at the time of utterance, and the characteristics of the interval of occurrence.

日本語音声の音韻のならびにおける特性は、音
節が、母音、子音＋母音、子音＋半母音＋母音で
構成されており子音だけが独立することがないこ
とである。また音声の発生時点における雑音の特
性は、パルス性の雑音であるため発生区間が孤立
していることであり、大まかな音韻分類にもとづ
くと、孤立した短い無声子音区間（一部有声音も
含む）と考えられる。つまり語頭の無声子音は、
音韻系列においてはじめて一定長以上有声子音が
連続する区間（母音）の前に連続する無声子音区
間で検出され、音声の発生時点における雑音は、
その連続有声子音区間とは孤立した一定長以下の
無声子音区間（一部有声音も含む）で検出され
る。 A characteristic of the phonetic sequence of Japanese speech is that syllables are composed of vowels, consonants + vowels, and consonants + semi-vowels + vowels, and consonants do not stand alone. Furthermore, the characteristic of noise at the time of speech generation is that it is a pulsed noise, so the generation interval is isolated. )it is conceivable that. In other words, the voiceless consonant at the beginning of a word is
For the first time in the phonetic series, voiced consonants of a certain length or longer are detected in a continuous unvoiced consonant interval before a continuous interval (vowel), and the noise at the time of speech generation is
The continuous voiced consonant section is detected as an isolated unvoiced consonant section (including some voiced sounds) of a certain length or less.

音声区間の音韻系列における語頭検出方式を第
５図にて説明する。同図は音韻系列のならびを示
したものであり、Ｈは音声区間切出しの始端フレ
ームを示し、■は有声音フレーム、□／は無声子音
フレーム、□は無音フレームを示している。 A method for detecting the beginning of a word in a phoneme sequence of a speech interval will be explained with reference to FIG. The figure shows the arrangement of phoneme sequences, where H indicates the starting frame of speech segment extraction, ■ indicates a voiced frame, □/ indicates a voiceless consonant frame, and □ indicates a silent frame.

まず始端フレームＨ以後に始めて一定長Ｌ３以
上有声音フレームの連続する区間を検出し、その
先頭フレームipを検出する（音節における母音、
半母音、有声子音の検出）。第５図ａのようにフ
レームＨとフレームipの間に無音フレームのない
場合には、フレームＨを語頭WHとする（雑音と
なる孤立フレームが存在しない）。フレームＨと
フレームipの間に無音フレームが存在する場合
は、フレームipに最も隣接した無音フレームの直
後のフレームを語頭候補フレームWH１とする
（母音、半母音の直前の無声子音を確保）。第５図
ｂのようにフレームＨとフレームWH１の間に一
定長Ｌ４以上の孤立した非無音フレーム（有声音
あるいは無声子音フレーム）が無に場合には、フ
レームWH１を語頭WHとする（音声の発声時点
の雑音除去）。第５図ｃのようにフレームＨとフ
レームWH１の間に一定長Ｌ４以上の孤立した非
無音フレームが存在する場合は、フレームWH１
に最も隣接した該非無音フレームの先頭フレーム
を語頭WHとする（語頭の無声子音および有声子
音の確保）。 First, starting after the starting frame H, a continuous section of voiced frames of a certain length L3 or more is detected, and its starting frame ip is detected (the vowel in the syllable,
detection of semivowels and voiced consonants). If there is no silent frame between frame H and frame ip as shown in FIG. 5a, frame H is taken as the beginning of the word WH (there is no isolated frame that becomes noise). If a silent frame exists between frame H and frame ip, the frame immediately after the silent frame closest to frame ip is set as the word-initial candidate frame WH1 (a silent consonant immediately before a vowel or semi-vowel is secured). As shown in Figure 5b, if there is no isolated non-silent frame (voiced sound or unvoiced consonant frame) of a certain length L4 or more between frame H and frame WH1, frame WH1 is set as the beginning of the word WH (of the voice). noise removal at the time of utterance). If there is an isolated non-silent frame of a certain length L4 or more between frame H and frame WH1 as shown in Figure 5c, frame WH1
The first frame of the non-silent frame most adjacent to is set as the word-initial WH (to ensure word-initial voiceless consonants and voiced consonants).

第６図は本発明の語頭検出方式を実現するため
の装置の具体構成を示すものである。同図におい
て、入力部１はフイルタ・バンク１３、標本化器
１４からなり、パラメータ計算部２は電力値計算
器１５、低域偏り値計算器１６、電力偏り値計算
器１７からなり、音声区間切出し部３は電力値判
別部１８、電力値系列カウント器１９からなり、
音韻分類部４は音韻大分類器２０、音韻系列カウ
ント器２１、音韻レジスタＡ２２、音韻レジスタ
Ｂ２３、出力ゲート２５から構成されている。１
２はマイクロホン、２４は音韻検出部、２６は出
力端子である。 FIG. 6 shows a specific configuration of an apparatus for realizing the word beginning detection method of the present invention. In the figure, the input section 1 consists of a filter bank 13 and a sampler 14, and the parameter calculation section 2 consists of a power value calculator 15, a low frequency bias value calculator 16, a power bias value calculator 17, and a voice interval The extraction unit 3 includes a power value discriminator 18 and a power value series counter 19.
The phoneme classification section 4 is composed of a phoneme major classifier 20, a phoneme sequence counter 21, a phoneme register A22, a phoneme register B23, and an output gate 25. 1
2 is a microphone, 24 is a phoneme detection section, and 26 is an output terminal.

次に動作を説明する。マイクロホン１２から入
力された入力音声信号は、フイルタ・バンク１３
および標本化器１４を介して、信号パラメータ系
列としてパラメータ計算部２に入力される。パラ
メータ計算部２では、電力値計算器１５によつて
パターン系列の電力値を計算し、音声区間切出部
および低域偏り値計算器１６、電力偏り値計算器
１７に入力される。低域偏り値計算器１６および
電力偏り値計算器１７は音声区間切出し部から音
声区間信号e₁が出力されていると、パターンおよ
び電力値から低域偏り値および電力偏り値を計算
し、音韻分類部へ出力する。音声区間切出し部で
は、電力値を電力値判定器１８によつて一定の閾
値レベル以上か否を判定し、一定の閾値レベル以
上のフレームは電力値系列カウント器１９によつ
てカウントを行ない、一定長のフレーム数連続す
る場合には、音声区間信号e₁を出力する。音韻分
類部４では、音声区間検出部３から音声区間信号
e₁が出力されていると、音韻大分類器２０は、パ
ラメータ計算部２から出力される電力値、低域偏
り値、電力偏り値から、各フレームの大まかな、
有声音か無声子音あるいは無音かの音韻分類を行
ない、音韻系列カウント器２１および音韻レジス
Ａ２２へ出力を行なう。音韻系列カウンタ２１
は、はじめて有声子音フレームが一定長Ｌ３以上
連続することを検出すると、音韻レジスタＡ２２
の内容を音韻レジスタＢ２３に並列転送する。語
頭検出部２４は、音韻レジスタＢ２３の音韻系列
のならびによつて語頭を検出し、音韻レジスタＢ
２３の内容を、語頭から出力ゲート２５を介し
て、出力音韻系列２６として出力を行い、音韻レ
ジスタＢ２３の内容を出力し終ると、音韻レジス
タＡ２２は遂次更新されているため、音韻レジス
タＢ２３の内容に連続するフレームから音韻レジ
スタＡ２２の内容を出力ゲート２５を介して、出
力音韻系列として出力端子２６から出力される。 Next, the operation will be explained. The input audio signal input from the microphone 12 is passed through the filter bank 13.
and is inputted to the parameter calculation unit 2 as a signal parameter series via the sampler 14. In the parameter calculating section 2, the power value of the pattern sequence is calculated by the power value calculator 15 and inputted to the voice section extraction section, the low frequency bias value calculator 16, and the power bias value calculator 17. When the voice section signal _e1 is output from the voice section extraction section, the low frequency bias value calculator 16 and the power bias value calculator 17 calculate the low frequency bias value and the power bias value from the pattern and the power value, and calculate the phoneme. Output to the classification section. In the voice section extraction unit, a power value determiner 18 determines whether the power value is equal to or higher than a certain threshold level, and frames whose power value is equal to or higher than a certain threshold level are counted by a power value sequence counter 19, If a long frame number continues, a voice section signal e ₁ is output. The phoneme classification section 4 receives the speech section signal from the speech section detection section 3.
If e ₁ is output, the phoneme rough classifier 20 roughly calculates the
The phonemes are classified as voiced, unvoiced consonants, or silent, and are output to the phoneme sequence counter 21 and the phoneme register A22. Phonological sequence counter 21
When detecting for the first time that a voiced consonant frame continues for a certain length L3 or more, the phoneme register A22
The contents of are transferred to the phoneme register B23 in parallel. The word beginning detection unit 24 detects the beginning of a word based on the alignment of the phoneme series in the phoneme register B23, and
23 is output from the beginning of the word through the output gate 25 as the output phoneme sequence 26. When the contents of the phoneme register B23 have been output, since the phoneme register A22 has been updated successively, the contents of the phoneme register B23 are The contents of the phoneme register A22 are outputted from the output terminal 26 as an output phoneme sequence via the output gate 25 from frames that follow the contents.

以上の説明から明らかなように本発明は入力信
号パターン系列の電力値に基づく音声区間の切り
出しを行ない、さらにこの音声区間の信号パター
ン系列の各フレームを、パターンから求まる低域
偏り値および電力偏り値とこの電力値に基づいて
大まかな音韻分類を行なつた後に、音声区間の音
韻系列のならびに基づいて語頭検出を行なうこと
により、音声の発声時における雑音を除去し、か
つ音声の語頭の無声子音の確保を図り、音声の語
頭検出精度を向上させることができ、音声認識装
置の認識率の改善を図ることができる。 As is clear from the above description, the present invention cuts out a voice section based on the power value of an input signal pattern series, and further extracts each frame of the signal pattern series of this voice section based on the low frequency bias value and power bias found from the pattern. After performing a rough phoneme classification based on the power value and the power value, the beginning of the word is detected based on the phoneme sequence of the speech interval, thereby removing noise at the time of speech production and unvoiced speech at the beginning of the speech. It is possible to secure consonants, improve the accuracy of detecting the beginning of speech, and improve the recognition rate of the speech recognition device.

[Brief explanation of the drawing]

第１図および第２図はそれぞれ音声区間を切り
出す操作を示す波形図、第３図は本発明による音
声の語頭検出方式を適用した音声信号パターン系
列を作成するフイルタ・バンクの周波数特性図、
第４図ａ，ｂは信号パターン系列の各フレームの
大まかな音韻分類と、パターンの電力値、低域偏
り値および電力偏り値との対応関係を示す図、第
５図ａ，ｂ，ｃはそれぞれ音声区間の音韻系列か
ら音声の語頭を検出する操作の過程説明図、第６
図は本発明を適用した音声の語頭検出装置のブロ
ツク図である。１……入力部、２……パラメータ計算部、３…
…音声区間切出し部、４……音韻分類部。 1 and 2 are waveform diagrams showing the operation of cutting out a speech section, respectively, and FIG. 3 is a frequency characteristic diagram of a filter bank for creating a speech signal pattern series to which the speech beginning detection method according to the present invention is applied.
Figures 4a and 4b are diagrams showing the correspondence between the rough phoneme classification of each frame of the signal pattern sequence and the power value, low frequency bias value, and power bias value of the pattern. A process explanatory diagram of the operation of detecting the beginning of a speech word from the phoneme sequence of a speech interval, respectively, Part 6
The figure is a block diagram of a speech word beginning detection device to which the present invention is applied. 1...Input section, 2...Parameter calculation section, 3...
...Speech segment cutting unit, 4...Phonological classification unit.

Claims

[Claims]

1. A start interval in which the power value of a series of input signal patterns that continues for a certain length is equal to or higher than the power value determined as a threshold value for the first time, and a power value for which the power value of the pattern series that continues for a certain length for the first time after that start interval is determined as a threshold value. The end section that is less than or equal to the value is detected and the speech section is cut out, and each frame of the speech section is classified into voiced sounds (vowels, voiced consonants), voiceless consonants, and silence based on the pattern of each frame. The beginning of the word in the speech section is detected from the beginning of the speech section, and if there is no silent frame in the section where voiced frames are continuous,
The starting point is taken as the beginning of a word, and if there is a silent frame, the frame immediately after the silent frame most adjacent to the continuous voiced frame section is taken as the beginning of a word,
If there is no continuous non-silent (voiced or unvoiced consonant) frame section between the starting point and the word-initial candidate, the word-initial candidate is taken as the word-initial, and if there is a continuous non-silent frame section, the word-initial A method for detecting the beginning of a word in speech, characterized in that the first frame of the continuous non-silent frame section most adjacent to a candidate is taken as the beginning of a word.