KR100322693B1 - Voice recognition method using linear prediction analysis synthesis - Google Patents
Voice recognition method using linear prediction analysis synthesis Download PDFInfo
- Publication number
- KR100322693B1 KR100322693B1 KR1019950018111A KR19950018111A KR100322693B1 KR 100322693 B1 KR100322693 B1 KR 100322693B1 KR 1019950018111 A KR1019950018111 A KR 1019950018111A KR 19950018111 A KR19950018111 A KR 19950018111A KR 100322693 B1 KR100322693 B1 KR 100322693B1
- Authority
- KR
- South Korea
- Prior art keywords
- recognition
- speech
- voice
- speaker
- input
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000015572 biosynthetic process Effects 0.000 title abstract description 8
- 238000003786 synthesis reaction Methods 0.000 title abstract description 8
- 230000001419 dependent effect Effects 0.000 claims abstract description 6
- 230000001755 vocal effect Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 9
- 230000005284 excitation Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
- G10L15/075—Adaptation to the speaker supervised, i.e. under machine guidance
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/12—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
본 발명은 입력되는 음성신호로부터 추출된 특징을 미리 준비한 표준특성으로 매핑시키고, 이로부터 합성된 합성음을 대상으로 음성인식을 수행하는 개선된 음성인식방법에 관한 것이다.The present invention relates to an improved speech recognition method for mapping a feature extracted from an input speech signal to a standard feature prepared in advance, and performing speech recognition on the synthesized sound synthesized therefrom.
제1도에 도시되는 바와 같이 종래의 음성인식방법은 입력되는 음성신호에서 추출된 특징을 사용하여 음성인식을 수행한다. 이러한 방식은 인식의 수행정예 기준이 되는 패턴의 작성 혹은 훈련이 필요한 데 실제 인식시 발성태도가 일치할 수가 없기 때문에 인식율의 저하가 초래된다.As shown in FIG. 1, the conventional voice recognition method performs voice recognition using a feature extracted from an input voice signal. This method requires the creation or training of the pattern that is the standard for performing the recognition, but the recognition rate is lowered because the speech attitude cannot be matched in actual recognition.
또한, 화자독립인식의 경우 많은 화자로부터 표본특징을 추출하고 이들을 적당히 그룹화하거나, 특징의 적응기술을 통해 기준패턴에 가까운 특징으로 변환하는과정이 수행되기도 한다. 이때 화자적응기법은 2차원에서 매핑하는 복잡한 알고리즘이며 그 성능이 한계가 있어 기대되는 만큼의 효과를 얻기가 매우 어렵다.In addition, in the case of speaker independence recognition, a process of extracting sample features from many speakers and grouping them appropriately or converting them into features close to the reference pattern through the adaptive technique of the features may be performed. In this case, the speaker adaptation method is a complex algorithm that maps in two dimensions, and its performance is limited, so it is very difficult to obtain the effect as expected.
그 이유는 화자의 특성을 유지하는 특징들의 집합에 또 다른 특성을 갖는 특징을 포함하도록 하는 알고리즘이기 때문에 최대의 효과를 얻기 위해서는 기준이 되는 특징 즉 수많은 화자의 음성데이타로부터 특징을 추출하고 이를 통계화하여야 한다.The reason is that the algorithm is to include a feature with another feature in the set of features that maintain the speaker's characteristics. shall.
게다가 수많은 특성의 존재는 특성의 분포범위가 점점 더 넓어짐을 의미하므로 이에의해 다른 음과의 변별력을 나타내는 경계가 모호해지게 된다.In addition, the presence of a large number of features means that the distribution of the features becomes wider, thereby blurring the boundaries that distinguish them from other sounds.
따라서, 인식률의 향상이 매우 어렵고 화자독립의 경우에는 화자종속의 경우보다 인식률이 떨어지게 되는 문제점이 있다.Therefore, it is very difficult to improve the recognition rate, and in the case of speaker independence, there is a problem that the recognition rate is lower than that of the speaker dependent.
본 발명은 화자의 발성마다의 변화와 화자간의 특성편차를 제거하여 화자종속의 경우보다 화자독립의 경우의 인식률을 향상시키는 음성인식 방법을 제공하는 것을 목적으로 한다.It is an object of the present invention to provide a speech recognition method that improves the recognition rate in the case of speaker independence than that of the speaker dependent by removing the change of each speaker's utterance and the characteristic deviation between the speaker.
상기의 목적을 달성하는 본 발명에 따른 음성인식방법은Speech recognition method according to the present invention to achieve the above object is
입력된 음성신호에서 화자종속적인 특성을 제거하여 표준화자의 특성으로 매핑시키는 과정;Removing speaker-dependent characteristics from the input voice signal and mapping the characteristics of the normalizer;
상기 매핑된 특성에 근거하여 합성음을 발생시키고, 발생된 합성음으로부터 인식특징을 추출하는 과정: 및Generating a synthesized sound based on the mapped characteristic and extracting a recognition feature from the generated synthesized sound: and
상기 합성음으로부터 추출된 인식특징을 표준패턴의 인식특징과 비교하여 음성을 인식하는 과정을 포함함을 특징으로 한다. 이하 첨부된 도면을 참조하여 본발명을 상세히 설명한다.And recognizing the speech by comparing the recognition feature extracted from the synthesized sound with the recognition feature of the standard pattern. Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.
제2도는 본 발명에 따른 음성인식방법을 보이는 도면이다. 본 발명에 따른 음성인식방법은 다음과 같은 과정을 통하여 수행된다.2 is a view showing a voice recognition method according to the present invention. Speech recognition method according to the present invention is carried out through the following process.
1) 선형예측분석에 의해 입력된 음성의 특징을 추출한다.1) Extract the features of the input voice by linear predictive analysis.
여기서, 선형예측분석이라 함은 음성신호를 성도모델과, 성도모델에 여기되는 여기신호(excitation signal)로 분석하는 기법으로 이에는 LPC(Liner Predictive Coding), PARCOR(Pattial Corelation), LSP(Line Spectrum Pair)등의 방법이 있다. 본 발명에서는 특징의 특성이 주파수대역에 따라 순차적으로 분포하여 특징매핑이 비교적 손쉬운 LSP분석을 사용하였고, 분석을 통하여 추출한 특징은 피치(pitch), 크기(amplitude), 10차 성도계수이다.Here, linear prediction analysis is a technique for analyzing a speech signal into a vocal tract model and an excitation signal excited by the vocal tract model. This includes linear prediction (PAC), PARCOR (Pattial Corelation), and LSP (Line Spectrum). Pair). In the present invention, the characteristics of the features are sequentially distributed according to the frequency band, so that the feature mapping is relatively easy. LSP analysis is used, and the features extracted through the analysis are pitch, amplitude, and tenth order coefficient.
여기서, 피치는 음성의 억양을 나타내고, 크기는 음성의 크기를 나타내며, 성도계수는 발성화자의 성도를 특징짖는 계수로 공명이 일어나는 주파수정보를 갖는다.Here, the pitch represents the intonation of the voice, the magnitude represents the loudness of the voice, and the vocal tract coefficient has frequency information in which resonance occurs as a coefficient characterizing the vocal vocalization of the utterance speaker.
2) 추출된 특징값을 표준화자의 특징값으로 매핑시킨다.2) The extracted feature values are mapped to feature values of the normalizer.
추출된 피치값은 지정된 대표화자의 피치값으로 매핑된다.The extracted pitch value is mapped to the pitch value of the designated representative speaker.
여기서, 화자가 변화가 심한 경우, 행정에 따른 왜곡이 커진다. 따라서 남자 대표화자, 여자 대표화자, 아이 대표화자의 3 그룹으로 분류하고, 남자/여자/아이 간에는 매핑을 하지 않도록 3개의 기본 특징 패턴을 만든다. 작성된 기본특징패턴은 다음과 같다.Here, when the speaker is severely changed, the distortion according to the stroke becomes large. Therefore, it is classified into three groups of male representative woman, female representative speaker and child representative speaker, and three basic feature patterns are created so as not to map between male / female / child. The created basic feature pattern is as follows.
대표피치값은 3가지로 남성, 여성, 아이를 대표하는 각기 한 사람의 음성신호를 분석하여 각각의 평균값을 미리 정해놓은 값이다.There are three representative pitch values, which are the values of each person's voice representing the male, female, and child.
음성신호 전반에 걸쳐 입력된 음성신호의 평균피치값이 그과의 차이가 가장 적은 대표피치값으로 대치되면 억양이 사라진 평탄한 음성신호가 된다.When the average pitch value of the input voice signal over the voice signal is replaced with the representative pitch value with the smallest difference, the accent disappears into a flat voice signal.
마찬가지로, 추출된 성도계수는 대표화자의 성도계수로 매핑된다.Similarly, the extracted saints' coefficients are mapped to the representative saint's coefficients.
대표화자의 성도계수는 3가지로 남자, 여자, 아이를 대표하는 각 한 사람의 음성신호를 분석하여 각각의 분포범위를 미리 정해놓은 값이다.The vocal tract coefficients of representative speakers are three, and each distribution range is determined by analyzing voice signals of each person representing a man, a woman, and a child.
추출된 입력신호의 각 차수별 성도계수의 분포범위가 남자, 여자, 아이의 분포범위중 가장 가까운 기준성도계수를 택한 후 입력신호의 각 차수별 성도계수의 분포범위가 선택된 대표특징의 각 차수별 분포범위가 되도록 성도계수를 조정한다.The distribution range of the stiffness coefficient for each order of the extracted input signal is the closest among the distribution ranges of male, female, and child, and then the distribution range of the stiffness coefficient for each order of the input signal is selected. Adjust the ductility coefficient to
성도계수의 조정은 이동(shift), 확장, 축소를 통하여 행한다. 이러한 특징매핑은 입력신호의 특성에서 발성태도에 따른 억양성분을 제거하고, 성도특성을 지정된 대표성도특성으로 변환한다는 의미가 된다.Adjustment of the ductility coefficient is performed by shift, expansion and contraction. This feature mapping means that the intonation component of the input signal is removed from the characteristics of the input signal, and the vocal traits are converted into the designated representative traits.
3) 특징매핑이 완료되면 매핑된 특징을 사용하여 합성을 행한다.3) When feature mapping is completed, synthesis is performed using the mapped features.
이때 합성은 제3도에 도시되는 바와 같이 LSP합성으로 매핑된 성도계수들을 합성필터계수로 입력하고, 정해진 피치값에 따라 유성음구간에서는 펄스(pulse), 무성음구간에서는 임의잡음(random noise)을 분석시 구해진 크기에 따라 생성하고, 이를 여기신호로 한다.In this case, as shown in FIG. 3, the saints coefficients mapped to LSP synthesis are input as synthesis filter coefficients, and the noise is analyzed in the voiced sound section and random noise in the unvoiced sound interval according to the determined pitch value. It is generated according to the size obtained at time, and is used as an excitation signal.
보통의 분석합성법에서는 합성의 여기신호로 분석시 추출되는 잔차신호 (residual signal)를 사용하는 데 이는 원음의 완벽한 재현을 위함이다.In general analysis synthesis method, the residual signal extracted during analysis is used as the excitation signal of the synthesis for the perfect reproduction of the original sound.
본 발명에서 잔차신호 대신에 펄스와 잡음을 여기신호로 선정한 것은 음성생성과정이 일정하게 되면 음을 생성하는 과정의 일관성을 유지할 수 있고, 잔차신호에 남아있는 화자특성을 배제할 수 있기 때문이다.In the present invention, the pulse and the noise are selected as the excitation signal instead of the residual signal because the sound generation process can maintain the consistency of the sound generation process and exclude the speaker characteristic remaining in the residual signal.
또한, 재합성의 결과로서 발생된 합성음의 프레임경계구간이 입력된 음성신호의 특징추출시 사용된 프레임경계구간과 일치되게하여 특징추출의 정확도를 향상시키도록 한다.In addition, the frame boundary of the synthesized sound generated as a result of the resynthesis is matched with the frame boundary used during the feature extraction of the input voice signal to improve the accuracy of feature extraction.
4) 합성된 음성을 대상으로 인식특징을 재추출한다.4) Re-extract the recognition features from the synthesized speech.
상기한 과정들의 결과로 화자내의 변화, 화자간의 변화에 의한 특징성분이 제거된 인식특징이 추출된다. 따라서, 종래의 음성인식 알고리즘와 같은 화자적응모듈이 불필요하게 된다.As a result of the above-described processes, the recognition feature from which the feature component due to the change in the speaker and the speaker is removed is extracted. Therefore, the speaker adaptation module like the conventional speech recognition algorithm is unnecessary.
또한. 합성과정을 삽입한 결과로서 프레임주기의 분석구간이 음성신호의 구간특성과 정확히 동기하게 되어 특징추출의 정확도가 향상된다.Also. As a result of the synthesis process, the analysis period of the frame period is synchronized with the interval characteristics of the speech signal accurately, and the accuracy of feature extraction is improved.
5) 기준의 인식알고리즘을 사용하여 음성인식을 수행한다.5) Speech recognition is performed using the standard recognition algorithm.
상기한 인식특징의 재추출과정이 이루어지면 DTW(Dynamic Time Warping), HMM(Hidden Markov Modeling), NN(Neural Network) 등의 인식알고리즘을 통하여 음성인식을 수행한다.When the re-extraction process of the recognition feature is performed, voice recognition is performed through recognition algorithms such as Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Neural Network (NN).
본 발명에서는 DTW방식을 사용하여 비교실험을 실시하였다.In the present invention, a comparative experiment was conducted using the DTW method.
실험의 결과 100개의 단어를 대상으로 한 화자종속 DTW방법의 경우에서는 97%의 인식률을, 화자독립의 DTW방법의 경우에서는 91%의 인식률을 얻은 반면에 본 발명에 따른 DTW방법의 경우에서는 화자독립, 98%의 인식률을 얻을 수 있었다.As a result of the experiment, the recognition rate of 97% was obtained in the speaker-dependent DTW method for 100 words and the speaker independence in the case of the DTW method according to the present invention. The recognition rate was 98%.
상술한 바와 같이 본 발명에 따른 음성인식방법은 입력되는 음성신호로부터 인식특징을 직접 추출하지 않고, 미리 준비한 표준화자의 음성으로 매핑하고, 합성한 합성음을 대상으로 새로운 인식특징을 추출하여 음성인식을 수행함으로써 화자의 발성마다의 변화와 화자간의 특성편차를 제거하여 인식률을 향상시키는 효과를 갖는다.As described above, the speech recognition method according to the present invention does not directly extract the recognition features from the input speech signal, but maps them to a pre-prepared standard voice, extracts new recognition features from the synthesized synthesized sound, and performs the speech recognition. As a result, the recognition rate can be improved by removing the change of each speaker's utterance and the speaker's characteristic deviation.
또한, 많은 화자의 반복훈련이 불필요하고, 화자의 실시간적응이 가능하다는 잇점이 있다.In addition, the repeated training of many speakers is unnecessary, and the speaker can be adapted in real time.
더우기, 선형예측특징의 1차원적인 특성을 이용하여 특징매핑의 간단화 및 정확도향상을 달성하는 효과도 있다.In addition, it has the effect of achieving simpler feature mapping and improved accuracy by using the one-dimensional characteristics of the linear predictive features.
제1도는 종래의 음성인식과정을 보이기 위한 도면이다.1 is a view for showing a conventional speech recognition process.
제2도는 본 발명에 따른 음성인식과정을 보이기 위한 도면이다.2 is a view for showing a voice recognition process according to the present invention.
제3도는 제2도에 도시된 음성합성과정을 수행하는 장치의 일 실시예를 보이는 블럭도이다.3 is a block diagram showing an embodiment of an apparatus for performing the speech synthesis process shown in FIG.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1019950018111A KR100322693B1 (en) | 1995-06-29 | 1995-06-29 | Voice recognition method using linear prediction analysis synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1019950018111A KR100322693B1 (en) | 1995-06-29 | 1995-06-29 | Voice recognition method using linear prediction analysis synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
KR970002856A KR970002856A (en) | 1997-01-28 |
KR100322693B1 true KR100322693B1 (en) | 2002-05-13 |
Family
ID=37460732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1019950018111A KR100322693B1 (en) | 1995-06-29 | 1995-06-29 | Voice recognition method using linear prediction analysis synthesis |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR100322693B1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100667522B1 (en) * | 1998-12-18 | 2007-05-17 | 주식회사 현대오토넷 | Speech Recognition Method of Mobile Communication Terminal Using LPC Coefficient |
KR100762588B1 (en) * | 2001-06-26 | 2007-10-01 | 엘지전자 주식회사 | voice recognition method for joing the speaker adaptation and the rejection of error input |
KR100427243B1 (en) * | 2002-06-10 | 2004-04-14 | 휴먼씽크(주) | Method and apparatus for analysing a pitch, method and system for discriminating a corporal punishment, and computer readable medium storing a program thereof |
CN112102833B (en) * | 2020-09-22 | 2023-12-12 | 阿波罗智联(北京)科技有限公司 | Speech recognition method, device, equipment and storage medium |
-
1995
- 1995-06-29 KR KR1019950018111A patent/KR100322693B1/en not_active IP Right Cessation
Also Published As
Publication number | Publication date |
---|---|
KR970002856A (en) | 1997-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vergin et al. | Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition | |
US6535852B2 (en) | Training of text-to-speech systems | |
Childers et al. | Voice conversion | |
US6829581B2 (en) | Method for prosody generation by unit selection from an imitation speech database | |
CN104272382A (en) | Method and system for template-based personalized singing synthesis | |
CN116665669A (en) | Voice interaction method and system based on artificial intelligence | |
KR20180078252A (en) | Method of forming excitation signal of parametric speech synthesis system based on gesture pulse model | |
Yadav et al. | Prosodic mapping using neural networks for emotion conversion in Hindi language | |
KR100322693B1 (en) | Voice recognition method using linear prediction analysis synthesis | |
CN116364096B (en) | Electroencephalogram signal voice decoding method based on generation countermeasure network | |
JP3281266B2 (en) | Speech synthesis method and apparatus | |
Saitou et al. | Analysis of acoustic features affecting" singing-ness" and its application to singing-voice synthesis from speaking-voice | |
Asakawa et al. | Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics | |
Acero | Source-filter models for time-scale pitch-scale modification of speech | |
Aso et al. | Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre | |
Blomberg | Adaptation to a speaker's voice in a speech recognition system based on synthetic phoneme references | |
JP3727885B2 (en) | Speech segment generation method, apparatus and program, and speech synthesis method and apparatus | |
Yathigiri et al. | Voice transformation using pitch and spectral mapping | |
US7130799B1 (en) | Speech synthesis method | |
JP7079455B1 (en) | Acoustic model learning devices, methods and programs, as well as speech synthesizers, methods and programs | |
Wang et al. | Tone recognition of continuous Mandarin speech based on hidden Markov model | |
JPH0293500A (en) | Pronunciation evaluating method | |
JP3921416B2 (en) | Speech synthesizer and speech clarification method | |
Banerjee et al. | Voice intonation transformation using segmental linear mapping of pitch contours | |
Orphanidou et al. | Voice morphing using the generative topographic mapping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant | ||
FPAY | Annual fee payment |
Payment date: 20101230 Year of fee payment: 10 |
|
LAPS | Lapse due to unpaid annual fee |