WO2003107326A1 - 音声認識方法及びその装置 - Google Patents
音声認識方法及びその装置 Download PDFInfo
- Publication number
- WO2003107326A1 WO2003107326A1 PCT/JP2002/005847 JP0205847W WO03107326A1 WO 2003107326 A1 WO2003107326 A1 WO 2003107326A1 JP 0205847 W JP0205847 W JP 0205847W WO 03107326 A1 WO03107326 A1 WO 03107326A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- section
- free
- free section
- point
- mountain
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000001514 detection method Methods 0.000 claims description 14
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 239000012141 concentrate Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
Definitions
- the present invention relates to a speech recognition method and apparatus for improving recognition performance under noise.
- Speech recognition is performed by comparing the power of an input signal with a preset threshold, detecting a section that is equal to or greater than the threshold as a speech section, and performing pattern matching with a standard pattern prepared in advance.
- FIG. 5 shows a speech recognition apparatus described in Japanese Patent Application Laid-Open No. 63-300295.
- FIG. 4 is a block diagram showing an example of the configuration. In this example, a case where word recognition of a specific speaker is performed will be described.
- ⁇ is the total number of frames of the input signal 2.
- the characteristic vector X (t) is, for example, an LPC cepstrum obtained by LPC analysis.
- the number of zero crossings z (t) is used to detect a voiced sound section. That is, since the voiced sound concentrates its power on the low frequency components, the number of zero crossings z (t) is small, so in this example, as described later, the voice power is equal to or more than a predetermined value and the number of zero crossings z (t) is A small section is regarded as a voiced sound.
- P (t) the power of the input signal
- Z (t) the time series 5 of the zero-crossing frequency
- a frame whose voice power exceeds a predetermined threshold is detected as the beginning PB of the voice section, and a frame whose voice power falls below the threshold is detected as the end PE of the voice section.
- a frame in which the number of zero crossings Z (t) is less than a predetermined threshold is detected as the start end ZB of the voiced sound section, and a frame in which the number is equal to or more than the threshold is detected as the end ZE of the voiced sound section.
- PB is the first frame of the start free section
- ZB is the last frame of the start free section.
- voiced sound The section from the end ZE of the section to the end PE of the voice section is determined as the end free section Efree2 ⁇ ZE, PE ⁇ .
- ZE is the first frame of the last free section
- PE is the last frame of the last free section.
- the standard pattern 10 stored in the standard pattern memory 9 and the pattern matching by DP matching for example, are combined with all the start and end points in the start-free section and the end-free section. And the one with the smallest distance value is regarded as the recognition result.
- the feature vector X (t) is, for example, the LPC cepstrum obtained by LPC (Linear Predictive Coding) analysis.
- one end-point free section is limited on the assumption that there is a correct start and end of the voice section between the detected voice section and the voiced sound section.
- there are various types of unsteady noise and there is a low risk of being judged to be a voiced section because the number of zero-crossings is small in noise where power is concentrated in the low region of the spectrum, such as noise in automobiles. Is big. Has noise If it is determined to be a vocal sound, pattern matching is performed including the noise section, which may cause erroneous recognition.
- the present invention has been made to solve the above-mentioned problem, and does not require determination of a voiced sound between which it is difficult to make an accurate determination, and efficiently limits an end-free section to improve speech recognition accuracy.
- An object of the present invention is to provide a voice recognition device. Disclosure of the invention
- a voice recognition method includes: an analysis step of acoustically analyzing an input voice and outputting power with respect to the input signal; and detecting a section in which the power continuously exceeds a predetermined threshold as a mountain section.
- the peak section with the maximum power is defined as the maximum section, and it is assumed that there is a start free section before the point at which the maximum section falls below the threshold, and the section ends after the point at which the maximum section exceeds the threshold.
- An end-free section determining step of outputting a combination of a start-free section and an end-free section assuming that a free section exists, and a step specified by the start-free section and the end-free section of the combination.
- the collation step may include the steps of: identifying each pattern and a standard pattern specified by a combination of all the start-free sections and the end-free sections output by the end-free section. In this configuration, pattern matching is performed.
- the speech recognition method is characterized in that the end point free section
- the determination step is configured so that, of the detected mountain sections, the mountain section having the largest power accumulation is set as the maximum mountain section.
- the analysis step outputs power for each detection point
- the end point free section determination means outputs the power for each of the detection points in the detected mountain section.
- the peak section in which the sum of the powers of a predetermined number of higher-order detection points out of the power is the maximum is defined as the maximum peak section.
- the width of a range in which the start free section is assumed to exist and the width of a range in which the end free section is assumed to exist Are different for each mountain section.
- the analysis step outputs power for each frame, and the end-point free section determining means determines that the power falls below the threshold when the power falls below the threshold. And a frame whose power exceeds the threshold value is defined as a point exceeding the threshold value.
- the speech recognition device is a speech recognition device comprising: an analysis unit that performs acoustic analysis of an input voice and outputs power with respect to the input signal;
- the peak section with the maximum power is defined as the maximum section, and it is assumed that there is a start-free section before the point at which the maximum section falls below the threshold.
- the matching unit may include a pattern and a standard pattern specified by a combination of all the start free sections and the end free sections output by the end free section. It is designed to perform pattern matching with the application.
- the end-point-free-section determining means is configured such that, among the detected peak sections, a peak section where power accumulation is maximum is set as a maximum peak section. Things.
- the analysis means outputs power for each detection point, and the end point free section determination means determines the detection point of the detected mountain section.
- the peak section in which the sum of the powers of a predetermined number of higher-order detection points in each power is the maximum is set as the maximum peak section.
- the endpoint free section determining means assumes that the start free section exists near a point where a mountain section before the maximum mountain section exceeds the threshold.
- the configuration is such that it is assumed that the terminal free section exists near a point where a mountain section after the maximum mountain section falls below the threshold.
- the end point free section determining means may include a range of a range where the start free section is assumed to exist and a range of a range where the end free section is assumed to exist. The width is different for each mountain section.
- the analysis unit outputs power for each frame, and the end point free section determination unit determines that the power falls below the threshold when the power falls below the threshold. And a frame whose power exceeds the threshold is regarded as a point exceeding the threshold.
- FIG. 1 is a configuration diagram of a speech recognition device according to Embodiments 1 and 2 of the present invention
- FIG. 2 is an explanatory diagram of a method of determining a start free section and an end free section according to the first embodiment of the present invention.
- FIG. 3 is an explanatory diagram of a method of determining a start-free section and an end-free section according to the second embodiment of the present invention.
- FIG. 4 is an explanatory diagram of a method for determining a start free section and an end free section according to the second embodiment of the present invention.
- FIG. 5 is a configuration diagram of a conventional speech recognition apparatus
- FIG. 6 is an explanatory diagram of a method of determining a start-free section and an end-free section according to a conventional technique.
- FIG. 1 is a block diagram illustrating a configuration of a speech recognition device according to a first embodiment of the present invention.
- 1 is a signal input terminal for inputting a signal
- 2 is an input signal input from the signal input terminal
- 3 is an analysis means for performing an acoustic analysis on the input signal
- 4 is a value calculated by the analysis means 3.
- 5 is the time series of the input signal characteristic vector calculated by the analysis means 3
- 6 is the end point series based on the input signal time series 5.
- 8 is the end free section information output by the end free section determining means 6
- 9 is the standard pattern used in the matching process for speech recognition.
- 10 is a standard pattern used in a matching process for voice recognition
- 11 is a matching means for performing pattern matching with a standard pattern of each word.
- the feature vector x (t) is, for example, the LPC cepstrum obtained by LPC analysis.o
- T is the total number of frames of the input signal 2.
- the feature vector X (t) is, for example, an LPC cepstrum obtained by LPC (linear prediction) analysis.
- C ⁇ P (t) is obtained, for example, by taking the logarithm of the sum of squares of the digital values of the input signal in the frame.
- B (i) is the beginning of the mountain section, that is, the frame whose power exceeds the threshold.
- E (i) is the end of the mountain section, that is, a frame whose power is less than the threshold.
- equation (1) calculates the power intensity for each mountain section as the sum of all the powers in the section. Whether to use (1) or (2) should be selected according to the type of environmental noise assumed when using the speech recognition device and the speech to be recognized.
- Equation (1) a rgm ax (i) PR ( ⁇ ) (3)
- the end point free section determination means 6 calculates the section from the starting point B 1 of the first peak section to the starting point B (I) of the peak section with the highest power intensity according to the equations (4) and (5).
- bfL is the first frame of the start free section
- bfR is the last frame of the start free section
- the start margin is set.
- bml and bm2 are predetermined constants of 0 or more.
- the end free section determination means 6 determines the end of the last mountain section.
- efL is the first frame of the end free section
- efR is the last frame of the end free section.
- the terminal margins em l and em2 are predetermined constants of 0 or more.
- Figure 2 shows the start free section BF and the end free section determined by the above processing.
- efl. E (I)-eml (6)
- efR E (N) + bm2 (7)
- Section EF ⁇ efL, efH ⁇ is output as end point free section information 8.
- the standard pattern of each word stored in the standard pattern memory 9 is a combination of the start-free section and all the start-ends in the end-free section.
- Perform pattern matching with REF (i), which is 10. This process is sequentially performed on all of the standard patterns REF (i) (i l, 2, 3,..., K).
- the standard pattern with the smallest difference is recognized and output as 1 result 1 2.
- DP matching is used as a pattern matching method.
- the present embodiment is based on the assumption that, under noise, "the peak with the highest intensity is not background noise, but speech or a part thereof to be recognized.”
- DP matching is used as a pattern matching method.
- HMM Hidden Markov Model
- continuous speech recognition, word recognition and The same effect can be obtained for voice.
- Example 2
- the start point B (i), (i l, 2,3, ...
- bfL (i) is the first frame of the i-th start-end free section and is obtained by equation (8).
- BfR (i) is the last frame of the i-th start-free section and is obtained by equation (9).
- the starting margin bmL (i) and bmR (i) are predetermined constants of 0 or more.
- e £ L (i) is the first frame of the i-th end free section
- efR (i) is the last frame of the i-th end free section.
- the start margins emL (i) and emR (i) are predetermined constants of 0 or more.
- Figure 3 shows the start-free one section and the end-free section determined by the above processing.
- the starting margins bmL (i) and bmR (i) are the starting margins — interval BF (i ) May be set to different values, or may be set to a common value.
- the margin bfLl on the left side of the first start free section BF1 is a free section outside the voice, so even if the value is increased, the danger of partial matching increases significantly. do not do.
- the right side of the first free start section BF1 and the second and subsequent free start sections may be within the voice section, and the values of the start margins bmL (i) and bm; R (i) Setting a large value increases the possibility of partial matching.
- the start margins on the left and right sides of the second and subsequent start-free sections and the right-side start margin of the first start-free section are combined with the left-side start margin of the first and subsequent start-free sections. If the value is smaller than 0 or 0, the free area inside The interval becomes smaller or becomes 0, which has the effect of suppressing partial matching.
- end margins emL (i) and emR (i) may be set to different values for each end free section EF (i), or may be set to a common value.
- the margin efR (N-I + l) on the right side of the last end free one section EF (N-I + 1) is a free section to the outside of the voice, so even if the value is increased, partial matching is performed. The danger is not so great.
- the values of the end margins emL (i) and emR (i) may be large because the left side of the last end section and the other end section may be within the voice section. Setting a higher value increases the possibility of partial matching.
- the end margins on the left and right sides of the other end-free section and the left side of the last end-free section are smaller or 0 than the right margin of the last end-free section.
- the start margin is set to the left and right sides of the second and subsequent start free sections
- the right start margin of the first start free section is set to 0
- the end margin is set to other than the last.
- the figure shows the case where the terminal margins on both the left and right sides of the terminal free section and the terminal margin on the left side of the last terminal free section are 0.
- DP matching is used as the pattern matching.
- the embodiment 1 In addition to the restriction on the end free section described in Section 2, the start free section is limited to the section before and after the rise of the peak, and the end free section is limited to the section before and after the end of the power peak. Thus, erroneous recognition due to partial matching can be further reduced.
- the present invention is configured as described above, it is not necessary to judge a voiced sound section for which it is difficult to make an accurate judgment, and the end point free section is efficiently limited so as not to be included in the speech as much as possible. This makes it possible to reduce erroneous recognition due to partial matching.
- the present invention is configured as described above, it is possible to select a combination of the start and end having the smallest difference from the standard pattern from all the combinations of the start and end.
- the present invention is configured as described above, speech recognition in an environment in which spike-like noise in which the instantaneous signal power becomes large but the instantaneous signal power becomes large can be generated although the duration of the power peak is short. Effect It can be done efficiently.
- the present invention is configured as described above, the duration of the peak of the power is long, but the maximum value of the power is not large, and the voice recognition is efficiently performed in an environment where noise can be generated. This is possible.
- the present invention is configured as described above, it is possible to reduce detection errors at the start and end.
- the present invention is configured as described above, it is possible to reduce the risk of partial matching while reducing the start and end points and detection errors.
- the present invention since the present invention is configured as described above, it can be applied to a speech recognition device that performs acoustic analysis on a frame basis.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Mobile Radio Communication Systems (AREA)
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2004514058A JPWO2003107326A1 (ja) | 2002-06-12 | 2002-06-12 | 音声認識方法及びその装置 |
US10/511,158 US20050165604A1 (en) | 2002-06-12 | 2002-06-12 | Speech recognizing method and device thereof |
CNA028291026A CN1628337A (zh) | 2002-06-12 | 2002-06-12 | 语音识别方法及其装置 |
PCT/JP2002/005847 WO2003107326A1 (ja) | 2002-06-12 | 2002-06-12 | 音声認識方法及びその装置 |
EP02738666A EP1513135A1 (en) | 2002-06-12 | 2002-06-12 | Speech recognizing method and device thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2002/005847 WO2003107326A1 (ja) | 2002-06-12 | 2002-06-12 | 音声認識方法及びその装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2003107326A1 true WO2003107326A1 (ja) | 2003-12-24 |
Family
ID=29727345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2002/005847 WO2003107326A1 (ja) | 2002-06-12 | 2002-06-12 | 音声認識方法及びその装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20050165604A1 (ja) |
EP (1) | EP1513135A1 (ja) |
JP (1) | JPWO2003107326A1 (ja) |
CN (1) | CN1628337A (ja) |
WO (1) | WO2003107326A1 (ja) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3673507B2 (ja) * | 2002-05-16 | 2005-07-20 | 独立行政法人科学技術振興機構 | 音声波形の特徴を高い信頼性で示す部分を決定するための装置およびプログラム、音声信号の特徴を高い信頼性で示す部分を決定するための装置およびプログラム、ならびに擬似音節核抽出装置およびプログラム |
CN100580770C (zh) * | 2005-08-08 | 2010-01-13 | 中国科学院声学研究所 | 基于能量及谐波的语音端点检测方法 |
KR101992676B1 (ko) | 2012-07-26 | 2019-06-25 | 삼성전자주식회사 | 영상 인식을 이용하여 음성 인식을 하는 방법 및 장치 |
JPWO2014069076A1 (ja) * | 2012-10-31 | 2016-09-08 | 日本電気株式会社 | 会話分析装置及び会話分析方法 |
JP6213476B2 (ja) * | 2012-10-31 | 2017-10-18 | 日本電気株式会社 | 不満会話判定装置及び不満会話判定方法 |
WO2014069122A1 (ja) * | 2012-10-31 | 2014-05-08 | 日本電気株式会社 | 表現分類装置、表現分類方法、不満検出装置及び不満検出方法 |
JP6358093B2 (ja) * | 2012-10-31 | 2018-07-18 | 日本電気株式会社 | 分析対象決定装置及び分析対象決定方法 |
US20140278393A1 (en) * | 2013-03-12 | 2014-09-18 | Motorola Mobility Llc | Apparatus and Method for Power Efficient Signal Conditioning for a Voice Recognition System |
US9786274B2 (en) * | 2015-06-11 | 2017-10-10 | International Business Machines Corporation | Analysis of professional-client interactions |
TWI672690B (zh) * | 2018-03-21 | 2019-09-21 | 塞席爾商元鼎音訊股份有限公司 | 人工智慧語音互動之方法、電腦程式產品及其近端電子裝置 |
CN108877778B (zh) | 2018-06-13 | 2019-09-17 | 百度在线网络技术(北京)有限公司 | 语音端点检测方法及设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61143800A (ja) * | 1984-12-18 | 1986-07-01 | 株式会社東芝 | 音声認識装置 |
EP0237934A1 (en) * | 1986-03-19 | 1987-09-23 | Kabushiki Kaisha Toshiba | Speech recognition system |
JPH0293696A (ja) * | 1988-09-30 | 1990-04-04 | Sanyo Electric Co Ltd | 音声認識装置 |
JPH08292787A (ja) * | 1995-04-20 | 1996-11-05 | Sanyo Electric Co Ltd | 音声・非音声判別方法 |
JP2000250565A (ja) * | 1999-02-25 | 2000-09-14 | Ricoh Co Ltd | 音声区間検出装置、音声区間検出方法、音声認識方法およびその方法を記録した記録媒体 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE69128582T2 (de) * | 1990-09-13 | 1998-07-09 | Oki Electric Ind Co Ltd | Methode zur Phonemunterscheidung |
-
2002
- 2002-06-12 WO PCT/JP2002/005847 patent/WO2003107326A1/ja not_active Application Discontinuation
- 2002-06-12 US US10/511,158 patent/US20050165604A1/en not_active Abandoned
- 2002-06-12 CN CNA028291026A patent/CN1628337A/zh active Pending
- 2002-06-12 EP EP02738666A patent/EP1513135A1/en not_active Withdrawn
- 2002-06-12 JP JP2004514058A patent/JPWO2003107326A1/ja not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61143800A (ja) * | 1984-12-18 | 1986-07-01 | 株式会社東芝 | 音声認識装置 |
EP0237934A1 (en) * | 1986-03-19 | 1987-09-23 | Kabushiki Kaisha Toshiba | Speech recognition system |
JPH0293696A (ja) * | 1988-09-30 | 1990-04-04 | Sanyo Electric Co Ltd | 音声認識装置 |
JPH08292787A (ja) * | 1995-04-20 | 1996-11-05 | Sanyo Electric Co Ltd | 音声・非音声判別方法 |
JP2000250565A (ja) * | 1999-02-25 | 2000-09-14 | Ricoh Co Ltd | 音声区間検出装置、音声区間検出方法、音声認識方法およびその方法を記録した記録媒体 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2003107326A1 (ja) | 2005-10-20 |
US20050165604A1 (en) | 2005-07-28 |
EP1513135A1 (en) | 2005-03-09 |
CN1628337A (zh) | 2005-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8140330B2 (en) | System and method for detecting repeated patterns in dialog systems | |
US7409341B2 (en) | Speech recognizing apparatus with noise model adapting processing unit, speech recognizing method and computer-readable medium | |
JP3004883B2 (ja) | 終話検出方法及び装置並びに連続音声認識方法及び装置 | |
WO2003107326A1 (ja) | 音声認識方法及びその装置 | |
CN112489692A (zh) | 语音端点检测方法和装置 | |
JP2996019B2 (ja) | 音声認識装置 | |
KR20180127020A (ko) | 자연어 대화체 음성 인식 방법 및 장치 | |
JP2000250593A (ja) | 話者認識装置及び方法 | |
JP2006010739A (ja) | 音声認識装置 | |
JPS60114900A (ja) | 有音・無音判定法 | |
JPH06110488A (ja) | 音声検出方法および音声検出装置 | |
JP3428805B2 (ja) | 音声認識のための音声区間始端補正方法及び装置並びに音声認識方法 | |
JP3868798B2 (ja) | 音声認識装置 | |
JPH07230293A (ja) | 音声認識装置 | |
JP2666296B2 (ja) | 音声認識装置 | |
JPH0772899A (ja) | 音声認識装置 | |
JP4391031B2 (ja) | 音声認識装置 | |
JP6451171B2 (ja) | 音声認識装置、音声認識方法、及び、プログラム | |
JP2003280678A (ja) | 音声認識装置 | |
JPH0484197A (ja) | 連続音声認識装置 | |
WO2020223797A1 (en) | Methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack | |
JPH01185599A (ja) | 音声認識装置 | |
KR20010091093A (ko) | 음성 인식 및 끝점 검출방법 | |
JP2000352987A (ja) | 音声認識装置 | |
JP2901976B2 (ja) | パターン照合予備選択方式 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 2004514058 Country of ref document: JP |
|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CN JP US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 10511158 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 20028291026 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002738666 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2002738666 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2002738666 Country of ref document: EP |