KR100912339B1

KR100912339B1 - Apparatus and Method of training a minority speech data

Info

Publication number: KR100912339B1
Application number: KR1020070045485A
Authority: KR
Inventors: 이우영; 박귀홍; 김영명; 정성택; 이영훈
Original assignee: 주식회사 케이티
Priority date: 2007-05-10
Filing date: 2007-05-10
Publication date: 2009-08-14
Also published as: KR20080099656A

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치 및 그 방법에 관한 것임.The present invention relates to a minor speaker speech data training apparatus and method using speech variation.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은 변조, 신축 등과 같은 음성 변이를 이용해 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 각양각색의 수많은 음성 데이터를 훈련시키는 것과 동일한 성능을 획득하기 위한, 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치 및 그 방법을 제공하는데 그 목적이 있음.In the present invention, a small number of speaker voice data training using a voice variation to obtain the same performance as training a large number of different voice data with only a small number of speech data spoken by a few speakers using modulation, stretching, etc. To provide a device and a method thereof.

3. 발명의 해결방법의 요지3. Summary of Solution to Invention

본 발명은 화자로부터 발화된 음성 데이터를 훈련시키는 장치에 있어서, 사전에 정한 소수의 화자가 발화한 음성 데이터[이하 "기준 음성 데이터"라 함]를 녹음하기 위한 음성 녹음부; 상기 음성 녹음부에서 녹음한 기준 음성 데이터에 대해 주파수 도메인 상에서의 피치 변조 방식 또는 시간 도메인 상에서의 신축 방식 중 적어도 어느 하나를 수행하여 다른 연령 또는/및 다른 성별의 음성 데이터로 변환하기 위한 음성 변이부; 상기 음성 변이부에서 변환한 기준 음성 데이터에 대해 음성 데이터 훈련 과정을 수행하기 위한 음성 데이터 훈련부; 및 상기 음성 데이터 훈련부에서 훈련한 음성 데이터를 토대로 음성인식모델 데이터베이스(DB)를 생성하기 위한 음성인식모델 생성부를 포함함.The present invention provides an apparatus for training voice data spoken by a speaker, comprising: a voice recording unit for recording voice data (hereinafter referred to as "reference voice data") uttered by a predetermined number of speakers; A voice translator for converting the reference voice data recorded by the voice recorder into at least one of a pitch modulation scheme in a frequency domain and a stretch scheme in a time domain to convert voice data of different ages and / or genders. ; A voice data training unit for performing a voice data training process on the reference voice data converted by the voice variation unit; And a voice recognition model generator for generating a voice recognition model database (DB) based on the voice data trained by the voice data training unit.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 음성 인식 등에 이용됨.The present invention is used for voice recognition.

소수 화자, 음성 데이터 훈련, 음성인식모델, 음성 변이, 피치(pitch) 변조, 템포(tempo) 변조 Minority Speaker, Speech Data Training, Speech Recognition Model, Speech Variation, Pitch Modulation, Tempo Modulation

Description

Apparatus and Method of training a minority speech data using speech variation

도 1은 본 발명에 따른 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치에 대한 일실시예 구성도,1 is a configuration diagram of an embodiment of a minority speaker voice data training apparatus using voice variation according to the present invention;

도 2a 내지 도 2e는 본 발명에 따라 기준 음성 데이터에 대해 음성 변이를 수행한 결과를 보여주기 위한 일실시예 그래프,2A to 2E are graphs of one embodiment for showing a result of performing a voice variation on reference voice data according to the present invention;

도 3은 본 발명에 따른 음성 변이를 이용한 소수 화자 음성 데이터 훈련 방법에 대한 일실시예 흐름도이다.3 is a flowchart illustrating a method for training minority speaker voice data using voice variation according to the present invention.

* 도면의 주요 부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawing

11 : 음성 녹음부 12 : 음성 변이부11: voice recording unit 12: voice transition unit

12a : 피치(pitch) 변이기 12b : 템포(tempo) 변이기12a: pitch shift 12b: tempo shift

13 : 음성 데이터 훈련부 14 : 음성인식모델 생성부13: voice data training unit 14: voice recognition model generation unit

본 발명은 화자로부터 발화된 음성 데이터를 훈련시키는 장치 및 그 방법에 관한 것으로, 더욱 상세하게는 변조, 신축 등과 같은 음성 변이를 이용해 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 각양각색의 수많은 음성 데이터를 훈련시키는 것과 동일한 성능을 획득하기 위한, 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for training speech data uttered from a speaker, and more particularly, to a variety of voices with only a small number of speech data uttered from a small number of speakers using voice variations such as modulation and stretching. The present invention relates to a minor speaker speech data training apparatus using voice variation and a method thereof for obtaining the same performance as training data.

최근에 단말기[예; 지능형 로봇(일명 "URC(Ubiquitous Robotic Companion) 로봇"이라고도 함), 휴대폰 등] 기술 발전에 힘입어 인터넷에 접속한 단말기를 통해 사용자에게 각종 콘텐츠, 서비스 등을 제공할 수 있게 되었다. 이하, 본 발명을 설명함에 있어 단말기의 일예로 "지능형 로봇"을 예로 들어 설명하기로 하나, 본 발명이 이에 한정되는 것이 아님을 미리 밝혀둔다.Recently a terminal [eg; Intelligent robots (also called "URC (Ubiquitous Robotic Companion Robots) robots", mobile phones, etc.), mobile phones, etc.) With the development of the technology, it is possible to provide various contents and services to users through a terminal connected to the Internet. Hereinafter, in the following description of the present invention will be described as an example of the "intelligent robot" as an example of the terminal, it will be clear that the present invention is not limited thereto.

예컨대, 이러한 지능형 로봇은 사용자와의 의사소통, 사용자 명령에 감정 반응, 특정 상황에서의 액션 반응 등을 통해 뉴스, 날씨, 교육 콘텐츠, 일정 관리, 홈 모니터링, 청소 등과 같이 다양한 콘텐츠, 서비스를 제공하고 있다.For example, such intelligent robots provide various contents and services such as news, weather, educational contents, schedule management, home monitoring, cleaning, etc. through communication with users, emotional responses to user commands, and action responses in specific situations. have.

한편, 위와 같이 사용자와 지능형 로봇 간의 의사소통, 정보 전달, 및 콘텐츠 제공 등의 편의를 위해 음성 기반의 멀티모달 방식이 각광받고 있다.On the other hand, for the convenience of communication between the user and the intelligent robot, information delivery, and content provision as described above, the voice-based multi-modal method has been spotlighted.

특히, 위와 같은 음성 기반의 멀티모달 방식에 있어 그 음성 인식 기술이 매우 중요한 문제인데, 음성 인식 수행 주체가 무엇이냐에 따라, 예컨대 지능형 로봇 자체에서 사용자로부터 입력받은 음성을 인식하는 경우를 단말 기반 음성 인식 기술이라고 하고, 음성인식서버에서 지능형 로봇으로부터 수신받은 음성을 인식하는 경우를 서버 기반 음성 인식 기술이라고 한다.In particular, the voice recognition technology is a very important problem in the voice-based multi-modal method. Depending on the subject of the voice recognition, for example, the intelligent robot itself recognizes the voice input from the user. The recognition technology is referred to as a server-based speech recognition technology in which a voice recognition server receives a voice received from an intelligent robot.

위와 같이 단말 기반 음성 인식 또는 서버 기반 음성 인식에 있어 무엇보다도 중요한 것은 음성 인식 성능을 높이는 것인데, 이러한 음성 인식 성능에 주요한 요소로는 어떠한 음성인식모델을 사용하느냐에 달려있다.As mentioned above, the most important thing in terminal-based speech recognition or server-based speech recognition is to increase the speech recognition performance. The main factor in the speech recognition performance depends on which speech recognition model is used.

앞서 언급한 음성인식모델을 구축하기 위해서는 수많은 화자들로부터 발화된 음성 데이터, 바람직하게는 연령별, 성별 등에 따른 각양각색의 수많은 음성 데이터를 수집한 후에 이를 훈련시켜야 된다.In order to build the above-mentioned speech recognition model, it is necessary to collect speech data from a large number of speakers, and preferably, a variety of speech data according to age, gender, and the like.

그러나 상기와 같은 종래 방식에서는 수많은 화자들을 모집해 각각의 화자들이 발화한 음성을 수집, 녹음한 후에 이를 훈련시키는데 상당한 시간과 비용이 들고 있는 형편이다.However, in the conventional method as described above, it takes a lot of time and money to recruit a large number of speakers and collect and record the voices spoken by each speaker and then train them.

특히, 지능형 로봇 분야에 있어 로봇별로 특화된 콘텐츠/서비스 제공 대상이 서로 상이한데, 예컨대 영어 교육 콘텐츠 제공 로봇에 있어서는 유아, 어린이를 그 대상으로 하며, 독거노인 케어 로봇에 있어서는 노인을 그 대상으로 하는 점을 고려컨대 이러한 유아/노인과 로봇[주; 물론 이 로봇과 인터넷을 통해 연결된 음성인식서버일 수도 있음] 간의 음성 인식을 위한 연령별, 성별 등에 따른 각양각색의 수많은 음성 데이터 훈련이 요구된다.In particular, in the field of intelligent robots, specialized robots provide different contents / services. For example, English-language content-providing robots are targeted for infants and children, and elderly care robots are targeted for the elderly. Consider these infant / elderly robots [Note; Of course, this robot and voice recognition server may be connected via the Internet.

그러나 유아의 음성, 노인의 음성을 수집하는데 있어 법적인 문제, 통제가 어렵다는 점 외에도 특히 음성 인식 퀄러티 보장을 위한 양질의 음성 데이터를 수 집하는 것은 현실적으로 상당한 어려움을 겪고 있는 형편이다.However, in addition to the legal problems and difficulty in controlling the collection of infant voices and the voices of the elderly, the collection of high quality voice data to guarantee the quality of speech recognition is a real difficulty.

따라서 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 각양각색의 수많은 음성 데이터를 훈련시키는 것과 동일한 성능을 획득하기 위한 기술이 절실히 요구되고 있으며, 특히 그 수집에 어려움이 있는 유아/노인 등의 음성 데이터 없이도 적은 수의 음성 데이터만을 가지고서 음성인식모델을 생성할 수 있는 기술이 절실히 요구되고 있다.Therefore, there is an urgent need for a technique for acquiring the same performance as training a variety of voice data with only a small number of voice data uttered by a few speakers, and especially voice data of infants / elderly persons having difficulty in collecting the voice data. There is an urgent need for a technology capable of generating a speech recognition model using only a small number of speech data without the need for it.

본 발명은 상기와 같은 문제점을 해결하고 상기와 같은 요구에 부응하기 위하여 제안된 것으로, 변조, 신축 등과 같은 음성 변이를 이용해 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 각양각색의 수많은 음성 데이터를 훈련시키는 것과 동일한 성능을 획득하기 위한, 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치 및 그 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems and to meet the above demands, and it is possible to generate a large number of various voice data using only a small number of voice data spoken by a few speakers using voice variations such as modulation and stretching. SUMMARY OF THE INVENTION An object of the present invention is to provide a minor speaker speech data training apparatus using a speech variation and a method thereof for obtaining the same performance as training.

본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 본 발명의 실시예에 의해 보다 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.Other objects and advantages of the present invention can be understood by the following description, and will be more clearly understood by the embodiments of the present invention. Also, it will be readily appreciated that the objects and advantages of the present invention may be realized by the means and combinations thereof indicated in the claims.

상기의 목적을 달성하기 위한 본 발명의 장치는, 화자로부터 발화된 음성 데이터를 훈련시키는 장치에 있어서, 사전에 정한 소수의 화자가 발화한 음성 데이터[이하 "기준 음성 데이터"라 함]를 녹음하기 위한 음성 녹음부; 상기 음성 녹음부에서 녹음한 기준 음성 데이터에 대해 주파수 도메인 상에서의 피치 변조 방식 또는 시간 도메인 상에서의 신축 방식 중 적어도 어느 하나를 수행하여 다른 연령 또는/및 다른 성별의 음성 데이터로 변환하기 위한 음성 변이부; 상기 음성 변이부에서 변환한 기준 음성 데이터에 대해 음성 데이터 훈련 과정을 수행하기 위한 음성 데이터 훈련부; 및 상기 음성 데이터 훈련부에서 훈련한 음성 데이터를 토대로 음성인식모델 데이터베이스(DB)를 생성하기 위한 음성인식모델 생성부를 포함한다.An apparatus of the present invention for achieving the above object is a device for training voice data uttered from a speaker, the recording of voice data (hereinafter referred to as "reference voice data") uttered by a predetermined number of speakers. Voice recording unit for; A voice translator for converting the reference voice data recorded by the voice recorder into at least one of a pitch modulation scheme in a frequency domain and a stretch scheme in a time domain to convert voice data of different ages and / or genders. ; A voice data training unit for performing a voice data training process on the reference voice data converted by the voice variation unit; And a voice recognition model generator for generating a voice recognition model database (DB) based on the voice data trained by the voice data training unit.

한편, 본 발명의 방법은, 화자로부터 발화된 음성 데이터를 훈련시키는 방법에 있어서, 사전에 정한 소수의 화자가 발화한 음성 데이터[이하 "기준 음성 데이터"라 함]를 녹음하는 기준 음성 데이터 녹음 단계; 상기 녹음한 기준 음성 데이터에 대해 주파수 도메인 상에서의 피치 변조 방식 또는 시간 도메인 상에서의 신축 방식 중 적어도 어느 하나를 수행하여 다른 연령 또는/및 다른 성별의 음성 데이터로 변환하는 음성 데이터 변환 단계; 상기 변환한 기준 음성 데이터에 대해 음성 데이터 훈련 과정을 수행하는 단계; 및 상기 훈련된 음성 데이터를 토대로 음성인식모델 데이터베이스(DB)를 생성하는 단계를 포함한다.On the other hand, in the method of the present invention, in a method of training voice data spoken by a speaker, a reference voice data recording step of recording voice data (hereinafter referred to as " reference voice data ") uttered by a predetermined number of speakers. ; A voice data conversion step of converting the recorded reference voice data into voice data of different ages and / or genders by performing at least one of a pitch modulation scheme in a frequency domain and a stretch scheme in a time domain; Performing a voice data training process on the converted reference voice data; And generating a speech recognition model database (DB) based on the trained speech data.

한편, 본 발명은, 프로세서를 구비한 음성 인식 기기에, 사전에 정한 소수의 화자가 발화한 음성 데이터[이하 "기준 음성 데이터"라 함]를 녹음하는 기능; 상기 녹음한 기준 음성 데이터에 대해 주파수 도메인 상에서의 피치 변조 방식 또는 시간 도메인 상에서의 신축 방식 중 적어도 어느 하나를 수행하여 다른 연령 또는/및 다른 성별의 음성 데이터로 변환하는 기능; 상기 변환한 기준 음성 데이터에 대해 음성 데이터 훈련 과정을 수행하는 기능; 및 상기 훈련된 음성 데이터를 토대로 음성인식모델 데이터베이스(DB)를 생성하는 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.On the other hand, the present invention provides a voice recognition device equipped with a processor, a function of recording voice data (hereinafter referred to as " reference voice data ") uttered by a predetermined number of speakers; Converting the recorded reference voice data into voice data of different ages and / or genders by performing at least one of a pitch modulation scheme in a frequency domain and a stretch scheme in a time domain; Performing a voice data training process on the converted reference voice data; And a computer readable recording medium having recorded thereon a program for realizing a function of generating a voice recognition model database (DB) based on the trained voice data.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, whereby those skilled in the art may easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치에 대한 일실시예 구성도이다.1 is a block diagram of an exemplary embodiment of an apparatus for training minority voice data using voice variation according to the present invention.

한편, 도 2a 내지 도 2e는 본 발명에 따라 기준 음성 데이터에 대해 음성 변이를 수행한 결과를 보여주기 위한 일실시예 그래프이다.Meanwhile, FIGS. 2A to 2E are graphs of one embodiment for showing a result of performing a voice variation on reference voice data according to the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 음성 변이를 이용한 소수 화자 음성 데이터 훈련 장치는 음성 녹음부(11), 피치(pitch) 변이기(12a)와 템포(tempo) 변이기(12b)를 구비하는 음성 변이부(12), 음성 데이터 훈련부(13) 및 음성인식모델 생성부(14)를 포함한다. 한편, 도면에 도시되어 있지는 않으나 통상의 아날로그 음성신호, 디지털 음성 데이터 등을 처리하는 공지의 신호 처리기가 본 발명에서 제시하는 장치에 구비되어 있음을 당업자라면 쉽게 이해할 수 있을 것이며, 본 발명의 요지를 흐릴 수 있다고 판단되어져 이러한 구성요소에 대해서는 그 설명을 생략하기로 한다.As shown in FIG. 1, the minority speaker voice data training apparatus using the voice shifter according to the present invention includes a voice recorder 11, a pitch shifter 12a, and a tempo shifter 12b. It includes a voice transition unit 12, a voice data training unit 13 and a voice recognition model generation unit 14 provided. On the other hand, although not shown in the drawings, those skilled in the art will readily understand that a known signal processor for processing conventional analog voice signals, digital voice data, and the like is provided in the apparatus of the present invention, and the gist of the present invention. It is determined that this may be blurred, so the description thereof will be omitted.

덧붙여, 본 발명에서 제시하는 장치는 음성 인식 수행 기능이 구비된 어떠한 기기, 예컨대 단말기[예; 지능형 로봇(일명 "URC(Ubiquitous Robotic Companion) 로봇"이라고도 함), 휴대폰 등], 음성인식서버 등에 적용될 수 있으며, 또한 음성인식모델 DB 구축에 사용될 수 있다.In addition, the device proposed in the present invention is any device equipped with a voice recognition function, such as a terminal [eg; It can be applied to an intelligent robot (also called "URC (Ubiquitous Robotic Companion Robot)", a mobile phone, etc.), a voice recognition server, etc., and can also be used to build a voice recognition model DB.

본 발명은 화자로부터 발화된 음성 데이터를 훈련시키는 것에 관한 것인데, 특히 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 각양각색의 수많은 음성 데이터를 훈련시키는 것과 동일한 성능을 획득하기 위해 기준 음성 데이터에 대해 변조, 신축 등과 같은 음성 변이를 수행한 후에 음성 데이터 훈련 과정을 거쳐 최종적인 음성인식모델을 생성한다.The present invention relates to training speech data spoken from a speaker, in particular to reference speech data to obtain the same performance as training a large number of different speech data with only a small number of speech data spoken from a few speakers. After performing voice variance such as modulation and stretching, the final voice recognition model is generated through voice data training process.

본 발명에서 제시하는 상기 "기준 음성 데이터"란 사전에 정한 소수의 화자가 발화한 음성 데이터를 의미하는데, 이러한 소수 화자로서는 표준말을 사용하는 20대, 30대, 40대 남녀 각각 10명 내외로 정하는 것이 바람직하다. 물론, 이러한 소수 화자의 음성이 보통 사람들의 음성보다 명료하게 발화되는 것이 음성 인식 성능 향상에 유리할 것이다.In the present invention, the "reference voice data" refers to voice data uttered by a predetermined number of speakers. As the minority speakers, 10 or 20 males and females in their 20s, 30s, and 40s using standard words are defined. It is preferable. Of course, it is advantageous for the speech recognition performance to improve the speech of the minority speaker more clearly than the speech of ordinary people.

앞에서 언급한 바와 같이, 종래 방식에 있어 그 음성인식모델 생성을 위한 음성 데이터 훈련에 요구되는 화자가 최소한 1천명, 예컨대 통계학적으로 각 연령별, 각 성별별로 다양한 사람들을 모집해 그 음성 데이터를 수집, 훈련시키는데 반 해, 본 발명에서는 음성 데이터 훈련에 요구되는 화자의 수가 극히 소수이기만 하면 된다. 그렇다면, 이하 본 발명의 구체적인 내용에 대해 살펴보기로 한다.As mentioned above, in the conventional method, at least 1,000 speakers required for training voice data for generating the voice recognition model, for example, statistically recruit various people for each age and gender, collect the voice data, In contrast to the training, the present invention only needs to have a very small number of speakers required for training voice data. If so, it will be described below with respect to the specific content of the present invention.

상기 음성 녹음부(11)는 기준 음성 데이터, 예컨대 사전에 정한 소수의 화자가 발화한 음성 데이터를 녹음한다. 예컨대, 관리자 등의 통제에 따라 표준말을 사용하는 20대, 30대, 40대 남녀 각각 10명 정도가 자신의 순서에 따라 음성인식모델 DB 구축에 통상적으로 사용되는 단어, 문장 등을 차례대로 발화하는 것을 각각 녹음시킨다.The voice recording unit 11 records reference voice data, for example, voice data uttered by a predetermined number of speakers. For example, about 10 men and women in their 20s, 30s, and 40s who use standard words under the control of the administrators, etc., utter the words, sentences, etc. that are commonly used to build a voice recognition model DB in their order. Record each one.

상기 음성 변이부(12)는 음성 녹음부(11)에서 녹음한 기준 음성 데이터에 대해 변조, 신축 등을 수행해 그 기준 음성 데이터를 다른 연령 또는/및 다른 성별의 음성 데이터로 변환한다. 여기서, 변조 기법은 주파수 도메인 상에서 기준 음성 데이터의 피치를 변경하여 음색을 변환하는 것이며, 신축 기법은 시간 도메인 상에서 기준 음성 데이터에 대응되는 하나의 주기를 갖는 파형을 복사해 덧붙이거나 삭제하는 것이다.The voice translator 12 modulates and expands the reference voice data recorded by the voice recorder 11 and converts the reference voice data into voice data of different ages and / or genders. Here, the modulation technique is to convert the tone by changing the pitch of the reference speech data in the frequency domain, the stretching technique is to copy or add or delete a waveform having one period corresponding to the reference speech data in the time domain.

특히, 상기 음성 변이부(12)는 피치(pitch) 변이기(12a)와 템포(tempo) 변이기(12b)를 구비하는데, 이 피치(pitch) 변이기(12a)는 피치 변조를 수행해 그 기준 음성 데이터의 소리를 높게 또는 낮게 변환하며, 이 템포(tempo) 변이기(12b)는 템포 변조를 수행해 그 기준 음성 데이터의 소리를 느리게 또는 빠르게 변환한다.In particular, the voice shifter 12 has a pitch shifter 12a and a tempo shifter 12b, which perform a pitch modulation to reference the pitch shifter 12a. The sound of the voice data is converted to high or low, and this tempo variable 12b performs tempo modulation to convert the sound of the reference voice data to be slow or fast.

도 2a 내지 도 2e에는 본 발명에 따라 기준 음성 데이터에 대해 음성 변이를 수행한 결과가 도시되어 있는데, 여기서 도 2a에 도시된 그래프가 20대의 특정 남성이 발화한 기준 음성 데이터에 관한 그래프를 나타낸다.2A to 2E show a result of performing voice variation on reference voice data according to the present invention, wherein the graph shown in FIG. 2A shows a graph about reference voice data uttered by a specific male in 20s.

도 2b에 도시된 바와 같이, 본 발명에서 제시한 음성 변이부(12)에서 도 2a에 도시된 기준 음성 데이터에 대해 그 피치를 30% 높이게 되면 남자 음성이 여자 음성으로 변환된다.As shown in FIG. 2B, when the voice transition unit 12 according to the present invention increases the pitch by 30% with respect to the reference voice data shown in FIG. 2A, the male voice is converted into a female voice.

도 2c에 도시된 바와 같이, 본 발명에서 제시한 음성 변이부(12)에서 도 2a에 도시된 기준 음성 데이터에 대해 그 피치를 60% 높이게 되면 남자 음성이 유아 음성으로 변환된다.As shown in FIG. 2C, when the voice transition unit 12 according to the present invention increases the pitch by 60% with respect to the reference voice data shown in FIG. 2A, the male voice is converted into an infant voice.

마찬가지로, 음성 변이부(12)에서 피치 높낮이를 조절하면서 하나의 기준 음성 데이터만으로 각 성별로 음성 데이터를 확보할 수 있다.Similarly, the voice transition unit 12 may secure voice data for each gender using only one reference voice data while adjusting the pitch height.

도 2d에 도시된 바와 같이, 본 발명에서 제시한 음성 변이부(12)에서 도 2a에 도시된 기준 음성 데이터에 대해 그 템포를 150% 높이게 되면 주파수 변화 없이도 빠르게 발화된 음성 데이터로 변환된다.As shown in FIG. 2D, when the tempo is increased by 150% with respect to the reference voice data shown in FIG.

도 2e에 도시된 바와 같이, 본 발명에서 제시한 음성 변이부(12)에서 도 2a에 도시된 기준 음성 데이터에 대해 그 템포를 75% 낮추게 되면 주파수 변화 없이도 느리게 발화된 음성 데이터로 변환된다.As shown in FIG. 2E, when the tempo is lowered by 75% with respect to the reference voice data shown in FIG. 2A in the voice variation unit 12 according to the present invention, the voice transition unit 12 converts the voice data to slow speech without changing the frequency.

마찬가지로, 음성 변이부(12)에서 템포 높낮이를 조절하면서 하나의 기준 음성 데이터만으로 각 연령별, 예컨대 연령에 따라 발화 속도가 차이나는 점을 고려컨대 다양한 연령별 음성 데이터를 확보할 수 있다.Similarly, while adjusting the tempo height in the voice transition unit 12, only one reference voice data may be used to secure voice data for various ages in consideration of the fact that the speech rate varies according to each age, for example, age.

한편, 상기 음성 데이터 훈련부(13)는 음성 변이부(12)에서 변환한 기준 음성 데이터, 예컨대 음성 변이부(12)를 거쳐 소수의 화자로부터 발화된 적은 수의 음성 데이터가 각 연령별, 각 성별별로 변환된 각양각색의 수많은 음성 데이터에 대해 공지의 음성 데이터 훈련 과정을 수행한다. 여기서, 음성 데이터 훈련 과정은 공지의 어떠한 알고리즘을 사용해도 무방하며, 그 상세한 설명을 생략하기로 한다.Meanwhile, the voice data training unit 13 converts the reference voice data converted by the voice translator 12, for example, a small number of voice data spoken from a few speakers via the voice translator 12 for each age and each gender. A well-known voice data training process is performed on a large variety of converted voice data. Here, the voice data training process may use any known algorithm, and a detailed description thereof will be omitted.

상기 음성인식모델 생성부(14)는 음성 데이터 훈련부(13)에서 훈련된 음성 데이터를 토대로 공지의 음성인식모델 DB를 생성한다. 예를 들어 이러한 음성인식모델로는 음성 인식 분야에서 널리 사용되고 있는 HMM(Hidden Markov Model) 등을 들 수 있다.The speech recognition model generating unit 14 generates a known speech recognition model DB based on the speech data trained by the speech data training unit 13. For example, such a speech recognition model is HMM (Hidden Markov Model) widely used in the speech recognition field.

먼저, 기준 음성 데이터, 예컨대 사전에 정한 소수의 화자가 발화한 음성 데이터를 녹음한다(301). 여기서, 녹음 대상 소수 화자는 표준말을 사용하는 20대, 30대, 40대 남녀 각각 10명 정도가 바람직하며, 이들은 관리자의 통제에 따라 자신의 순서에 따라 음성인식모델 DB 구축에 통상적으로 사용되는 단어, 문장 등을 차례대로 발화한다.First, reference voice data, for example, voice data uttered by a predetermined number of speakers is recorded (301). Here, it is preferable that the number of recorded minority speakers is about 10 men and women in their 20s, 30s and 40s using standard words, and these words are commonly used in constructing a DB model based on their order under the control of the administrator. Fire sentences in order.

그런 후, 상기 녹음한 기준 음성 데이터에 대해 변조, 신축 등을 수행해 그 기준 음성 데이터를 다른 연령 또는/및 다른 성별의 음성 데이터로 변환한다(302). 이때, 피치 변조를 수행해 그 기준 음성 데이터의 소리를 높게 또는 낮게 변환하거나 또는/및 템포 변조를 수행해 그 기준 음성 데이터의 소리를 느리게 또는 빠르게 변환한다.Thereafter, the recorded reference voice data is modulated, stretched, and the like, and the reference voice data is converted into voice data of different ages and / or genders (302). At this time, the pitch modulation is performed to convert the sound of the reference voice data high or low, and / or the tempo modulation is performed to convert the sound of the reference voice data slow or fast.

그리고서, 상기 변환한 기준 음성 데이터[음성 변이 수행 결과로 소수의 화 자로부터 발화된 적은 수의 음성 데이터가 각 연령별, 각 성별로 변환된 각양각색의 수많은 음성 데이터]에 대해 공지의 음성 데이터 훈련 과정을 수행한다(303). 여기서, 음성 데이터 훈련 과정은 공지의 어떠한 알고리즘을 사용해도 무방하며, 그 상세한 설명을 생략하기로 한다.Then, a well-known voice data training process is performed on the converted reference voice data (a large number of different voice data in which a small number of voice data spoken from a small number of speakers as a result of voice variation has been converted to each age and each gender). Perform 303. Here, the voice data training process may use any known algorithm, and a detailed description thereof will be omitted.

그리고서, 상기 훈련된 음성 데이터를 토대로 공지의 음성인식모델 DB를 생성한다(304). 예를 들어, 이러한 음성인식모델로는 음성 인식 분야에서 널리 사용되고 있는 HMM(Hidden Markov Model) 등을 들 수 있다.Then, a known speech recognition model DB is generated based on the trained speech data (304). For example, the speech recognition model may include a HMM (Hidden Markov Model) widely used in the speech recognition field.

전술한 바와 같이, 본 발명에서 제시한 알고리즘을 사용해 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 각양각색의 수많은 음성 데이터를 획득해, 이를 훈련시켜 생성한 음성인식모델을 지능형 로봇, 휴대폰, 음성인식서버 등에 탑재한 경우에, 사용자로부터 입력되는 음성에 대해 본 음성인식모델 DB, 기존 발음사전 DB, 기존 발성문법 DB 등을 통해 그 음성 인식을 수행, 예컨대 그 입력 음성에 대응되는 텍스트 형태의 단어, 문장을 만들 수 있다.As described above, using the algorithm proposed in the present invention, a large number of various voice data are obtained with only a small number of speech data spoken by a few speakers, and the trained speech recognition model is generated by intelligent robot, mobile phone, voice. When mounted on a recognition server or the like, the voice recognition is performed on the voice input from the user through the voice recognition model DB, the existing pronunciation dictionary DB, the existing grammar DB, and the like, for example, a word in a text form corresponding to the input voice. , You can make sentences.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form. Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같은 본 발명은 소수의 화자로부터 발화된 적은 수의 음성 데이터만으로도 연령별, 성별 등에 따른 각양각색의 수많은 음성 데이터를 획득해, 이를 훈련시켜 고성능의 음성인식모델을 생성할 수 있는 효과가 있다.The present invention as described above has the effect of generating a high-performance speech recognition model by obtaining a large number of different voice data according to age, sex, etc. by using only a small number of speech data spoken by a few speakers.

또한, 본 발명은 그 음성 수집에 어려움이 있는 유아/노인 등의 음성 데이터 없이도 기준 음성 데이터만을 가지고서 음성인식모델을 생성할 수 있는 효과가 있다.In addition, the present invention has the effect that it is possible to generate a voice recognition model with only the reference voice data without the voice data, such as infants / the elderly, which is difficult to collect the voice.

또한, 본 발명은 수천명의 음성 데이터를 수집, 녹음, 훈련시키지 않고서도 음성인식모델을 생성할 수 있어서, 그 음성인식모델 DB 구축에 상당한 시간과 비용을 절감할 수 있는 효과가 있다.In addition, the present invention can generate a voice recognition model without collecting, recording, and training thousands of voice data, thereby having a significant time and cost saving in constructing the voice recognition model DB.

Claims

In the device for training the speech data spoken from the speaker,

A voice recording unit for recording voice data (hereinafter referred to as " reference voice data ") uttered by a predetermined number of speakers;

A voice translator for converting the reference voice data recorded by the voice recorder into at least one of a pitch modulation scheme in a frequency domain and a stretch scheme in a time domain to convert voice data of different ages and / or genders. ;

A voice data training unit for performing a voice data training process on the reference voice data converted by the voice variation unit; And

A voice recognition model generator for generating a voice recognition model database (DB) based on the voice data trained by the voice data training unit

Minority speaker voice data training device using a voice variation comprising a.

The method of claim 1,

The voice variant unit,

A pitch shifter for performing pitch modulation on a frequency domain to convert the sound of the reference speech data high or low; And

Tempo variation for slow or fast conversion of the reference speech data by performing tempo modulation in the time domain

The method according to claim 1 or 2,

The reference voice data,

A minority speaker voice data training device using voice variation, characterized by including voice data uttered by a few men and women of each age group.

In the method of training the speech data spoken from the speaker,

A reference voice data recording step of recording voice data (hereinafter referred to as " reference voice data ") uttered by a predetermined number of speakers;

A voice data conversion step of converting the recorded reference voice data into voice data of different ages and / or genders by performing at least one of a pitch modulation scheme in a frequency domain and a stretch scheme in a time domain;

Performing a voice data training process on the converted reference voice data; And

Generating a speech recognition model database (DB) based on the trained speech data

Minority speaker voice data training method using a voice variation comprising a.

The method of claim 4, wherein

The voice data conversion step,

A voice modulation characterized in that a pitch modulation is performed on the frequency domain to convert the sound of the reference voice data high or low, and / or a tempo modulation is performed on the time domain to slow or fast convert the sound of the reference voice data. Minority speaker voice data training method.

The method according to claim 4 or 5,

The reference voice data,

Minority speaker voice data training method using a voice mutation, characterized in that it comprises voice data uttered by a few men, women of each age group.

The method of claim 6,

The reference voice data recording step,

The minority speaker voice data training method using the speech variation, characterized in that the minority of the speaker in order to utter the words, sentences commonly used to build a voice recognition model database (DB) in their order.

delete