KR20170140680A

KR20170140680A - Method and apparatus for generating speaker rocognition model by machine learning

Info

Publication number: KR20170140680A
Application number: KR1020160073320A
Authority: KR
Inventors: 금명철; 윤재선; 이강규; 이항섭
Original assignee: 주식회사 셀바스에이아이
Priority date: 2016-06-13
Filing date: 2016-06-13
Publication date: 2017-12-21
Also published as: KR101833731B1

Abstract

The present invention relates to a method and an apparatus for generating a speaker recognition model through machine learning, the method according to an embodiment of the present invention comprising: receiving a voice of a speaker to be learned through machine learning; applying, to the voice of the speaker, a voice conversion filter for converting a voice in accordance with a characteristic of a voice recognition device and an environment in which the voice of the speaker can be received; and acquiring a speaker recognition model by machine learning of the voice of the speaker and the voice of the speaker to which the voice conversion filter has been applied. The method and apparatus for generating a speaker recognition model are capable of identifying the speaker through machine learning even if the voice of the speaker is converted in accordance with the environment and characteristic of the voice recognition device.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and apparatus for generating a speaker recognition model by machine learning,

본 발명은 머신 러닝을 통한 화자 인식 모델 생성 방법 및 장치에 관한 것으로서, 보다 상세하게는 화자의 음성을 인식하는 환경의 변화 및 화자의 발화 상태에 영향없이 화자의 음성을 통해 화자를 구분할 수 있는 화자 인식 모델을 생성하는, 머신 러닝을 통한 화자 인식 모델 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for generating a speaker recognition model through machine learning, and more particularly, to a method and apparatus for generating a speaker recognition model by using a speaker capable of distinguishing a speaker through a speaker's voice without affecting a change in environment for recognizing the speaker's speech, The present invention relates to a method and apparatus for a speaker recognition model through machine learning.

음파란 진동이 균일하던 공기의 압력에 변화가 생겨 종파의 형태로 진동되어 생기는 파동을 의미한다. 음파는 20Hz에서 20,000Hz에 대한 진동수 영역인 가청음파, 20Hz 이하의 영역인 저음파 및 20,000Hz 이상의 영역인 초음파로 구분된다. A sound wave is a wave which is generated by vibrating in the form of a longitudinal wave due to a change in pressure of the air having a uniform vibration. Sound waves are divided into audible sound waves, which are the frequency range from 20 Hz to 20,000 Hz, low sound waves which are in the region below 20 Hz, and ultrasonic waves, which are in the region above 20,000 Hz.

음파는 파동하는 성질을 가지고 있어 다양한 기술분야에서 활용되고 있다. 물이 존재하는 환경에서의 물리 탐사를 위한 수중 탐사용 기기, 위치 파악 기기, 초음파를 활용한 검진 기기 및 물리 치료 기기에서 음파는 핵심 기술이다. 또한, 두뇌에 있는 뉴런 세포 유전자를 활성화 또는 비활성화하는 소리 유전학, 물체를 들어 올리거나 이동시키는 소리 홀로그램도 음파를 활용한 기술이다. The sound wave has a ripple nature and is used in various technical fields. Sound waves are a key technology in underwater navigation devices, location devices, ultrasonic inspection devices, and physiotherapy devices for physical exploration in the presence of water. Sound genetics, which activates or deactivates neuronal cell genes in the brain, and sound holograms that lift or move objects are also techniques using sound waves.

더 나아가, 최근에는 음성의 음파를 인식하여, 음성으로 처리 단말기 즉, 컴퓨터, 스마트폰, 태블릿 PC 등을 작동시키는 음성 인식 기술 및 처리 단말기를 통해 화자를 구분하는 화자 인식 기술이 발전하고 있다. 하지만, 음성 및 화자 인식 기술은 여러 조건에 의해 변형되기 쉬운 음파를 활용한다는 점에서 한계를 지니고 있다. 구체적으로, 음성 및 화자 인식 기술이 음파를 수신한 환경, 화자의 상태에 따른 음성에 대한 음파의 변화에 따라, 음성 인식 기술이 처리한 결과값은 사용자의 음성이 의도했던 결과값과는 상이할 수 있다. 예를 들어, 음성 인식 장치는 음성을 수신한 환경 변화에 영향을 받아, 화자를 분류하는 것이 아니라 환경을 분류하거나, 화자의 발화 습관이나 건강, 감정 상태에 따라 변화한 음성으로 인해 화자 인식을 실패하는 불편함이 존재한다. Furthermore, recently, a speaker recognition technology for recognizing a speaker through speech recognition technology and a processing terminal for recognizing a sound wave of voice and operating a processing terminal, that is, a computer, a smart phone, a tablet PC, etc., has been developed. However, speech and speech recognition technologies have limitations in that they utilize sound waves that are easily distorted by various conditions. Specifically, depending on the environment in which the voice and speaker recognition technology has received the sound waves and the change in the sound waves with respect to the sound according to the speaker's state, the result value processed by the voice recognition technology is different from the intended result . For example, the speech recognition apparatus is influenced by a change in the environment in which the speech is received, so that it is possible to classify the environment, not to classify the speaker, or to fail the speaker recognition due to the speech changed according to the speech utterance habits, health, There is an inconvenience to do.

따라서, 변형된 음파를 처리하여 음성 및 화자를 인식할 수 있는 방법에 대한 요구가 존재한다.Thus, there is a need for a method that can process modified sound waves to recognize speech and speech.

[관련기술문헌][Related Technical Literature]

음성인식방법 및 장치 (공개특허 10-2004-0061659호)A method and apparatus for speech recognition (Patent Document 10-2004-0061659)

본 발명이 해결하고자 하는 과제는 화자의 음성을 인식한 환경 요소 즉, 음성 발화 장소, 음성 인식 장치 및 화자의 발화 상태와 무관하게 화자의 음성을 인식하여 화자를 구분할 수 있는 머신 러닝을 통한 화자 인식 모델 생성 방법 및 장치를 제공하는 것이다. SUMMARY OF THE INVENTION The present invention has been made in an effort to solve the above problems, and an object of the present invention is to provide a speaker recognition apparatus and a speaker recognition apparatus, which can recognize an environment element that recognizes a speaker's voice, A method and an apparatus for generating a model are provided.

본 발명이 해결하고자 하는 다른 과제는 하나의 음성에 다양한 환경 요소 및 화자의 발화 상태를 적용시키는 시뮬레이션과 시뮬레이션의 결과를 기초로 수행한 머신 러닝을 통해 화자를 구분할 수 있는 머신 러닝을 통한 화자 인식 모델 생성 방법 및 장치를 제공하는 것이다.Another object to be solved by the present invention is to provide a speaker recognition system using machine learning capable of classifying a speaker through simulation based on the results of simulations and simulations that apply various environment factors and speaker's speech state to one voice, And to provide a method and an apparatus for generating the same.

본 발명의 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems of the present invention are not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the following description.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 머신 러닝을 통한 화자 인식 모델 생성 방법은 머신 러닝 (machine learning) 하고자 하는 화자의 음성을 수신하는 단계, 화자의 음성이 수신될 수 있는 환경 및 음성 인식 장치의 특징에 따라, 음성을 변환하는 음성 변환 필터를 화자의 음성에 적용하는 단계 및 화자의 음성 및 음성 변환 필터를 적용한 화자의 음성을 머신 러닝하여 화자 인식 모델을 획득하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method of generating a speaker recognition model through machine learning, the method comprising: receiving a speech of a speaker to be machine-learned; A step of applying a speech conversion filter for converting a speech to a speech of a speaker according to characteristics of the environment and the speech recognition apparatus and a step of obtaining a speaker recognition model by machine learning a speech of a speaker to which a speech and speech conversion filter of the speaker is applied .

본 발명의 다른 특징에 따르면, 화자 인식 모델을 획득하는 단계는 화자의 음성 및 음성 변환 필터를 적용한 화자의 음성을 머신 러닝하여 음성 변환 필터의 제1 가중치 및 화자의 음성의 제2 가중치를 획득하는 단계를 포함할 수 있다.According to another aspect of the present invention, the step of acquiring the speaker recognition model comprises the steps of performing machine learning of the speech of the speaker applying the speech and speech conversion filter of the speaker to obtain a first weight of the speech conversion filter and a second weight of the speech of the speaker Step < / RTI >

본 발명의 또 다른 특징에 따르면, 제1 가중치 및 제2 가중치를 획득하는 단계는 머신 러닝을 통해 제1 가중치 및 제1 가중치보다 높은 제2 가중치를 획득하는 단계일 수 있다.According to still another aspect of the present invention, the step of acquiring the first weight and the second weight may be a step of acquiring a first weight through the machine learning and a second weight higher than the first weight.

본 발명의 또 다른 특징에 따르면, 음성 변환 필터를 적용하는 단계는 화자의 음성에 각각 이퀄라이저, 리버브, 음성의 속도 및 음성의 높낮이 중 적어도 하나를 변경한 음성 변환 필터를 적용하는 단계일 수 있다.According to still another aspect of the present invention, the step of applying the voice conversion filter may be a step of applying a voice conversion filter to the speaker's voice in which at least one of the equalizer, the reverb, the speed of the voice and the height of the voice are changed.

본 발명의 또 다른 특징에 따르면, 화자 인식 모델을 획득하는 단계는 음성 변환 필터가 적용된 화자의 음성을 저장한 데이터베이스를 생성하는 단계 및 데이터베이스에 저장된 화자의 음성을 기초로 음성 변환 필터에 대응하여 변하는 화자의 음성 특성을 학습하는 단계를 포함할 수 있다.According to still another aspect of the present invention, the step of acquiring the speaker recognition model includes the steps of generating a database storing the speech of the speaker to which the speech conversion filter is applied, And learning the speech characteristics of the speaker.

본 발명의 또 다른 특징에 따르면, 화자의 음성을 수신하는 경우, 화자 인식 모델을 화자의 음성에 적용하여 화자를 인식하는 단계를 더 포함할 수 있다.According to still another aspect of the present invention, the method may further include recognizing the speaker by applying the speaker recognition model to the speaker's voice when receiving the speaker's voice.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 머신 러닝을 통한 화자 인식 모델 생성 장치는 머신 러닝 하고자 하는 화자의 음성을 수신하는 통신부 및 화자의 음성이 수신될 수 있는 환경 및 음성 인식 장치의 특징에 따라 음성을 변환하는 음성 변환 필터를 화자의 음성에 적용하는 프로세서를 포함하고, 프로세서는 화자의 음성 및 음성 변환 필터를 적용한 상기 화자의 음성을 머신 러닝하여 화자 인식 모델을 획득한다.According to an aspect of the present invention, there is provided an apparatus for generating a speaker recognition model through machine learning, including a communication unit for receiving a voice of a speaker to be machine-operated, an environment in which a voice of the speaker can be received, And a processor for applying a voice conversion filter for converting a voice according to a characteristic of the recognizing device to the voice of the speaker, wherein the processor obtains a speaker recognition model by machine-learning the voice of the speaker to which the voice and voice conversion filter of the speaker is applied .

본 발명의 다른 특징에 따르면, 프로세서는 화자의 음성에 각각 이퀄라이저, 리버브, 음성의 속도 및 음성의 높낮이 중 적어도 하나를 변경한 음성 변환 필터를 적용할 수 있다.According to another aspect of the present invention, the processor may apply a speech conversion filter to each of the speech of the speaker by changing at least one of an equalizer, a reverb, a speed of speech, and a pitch of speech.

본 발명의 또 다른 특징에 따르면, 음성 변환 필터가 적용된 화자의 음성을 저장하는 저장부를 더 포함하고, 저장부는 화자의 음성에 대한 데이터베이스를 생성하고, 프로세서는 데이터베이스에 저장된 화자의 음성을 기초로 음성 변환 필터에 대응하여 변하는 화자의 음성 특성을 학습할 수 있다.According to another aspect of the present invention, there is provided a speech recognition apparatus, further comprising a storage unit for storing speech of a speaker to which a speech conversion filter is applied, wherein the storage unit generates a database of speech of the speaker, It is possible to learn the speech characteristic of the speaker which changes in accordance with the conversion filter.

본 발명의 또 다른 특징에 따르면, 프로세서는 화자의 음성 및 음성 변환 필터를 적용한 화자의 음성을 머신 러닝하여 음성 변환 필터의 제1 가중치 및 화자의 음성의 제2 가중치를 획득할 수 있다.According to another aspect of the present invention, the processor may machine the speech of the speaker applying the speech and speech translation filter of the speaker to obtain a first weight of the speech translation filter and a second weight of the speech of the speaker.

본 발명의 또 다른 특징에 따르면, 프로세서는 머신 러닝의 결과를 기초로 화자 인식 모델을 생성하고, 화자의 음성을 수신하는 경우, 화자 인식 모델을 화자의 음성에 적용하여 화자를 인식할 수 있다.According to another aspect of the present invention, a processor generates a speaker recognition model based on a result of machine learning, and when receiving a speaker's voice, the speaker can recognize the speaker by applying the speaker recognition model to the speaker's voice.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 머신 러닝을 통한 화자 인식 모델 생성 방법을 제공하는 명령어들을 저장하는 컴퓨터 판독 가능 기록매체는 머신 러닝 하고자 하는 화자의 음성을 수신하고, 화자의 음성이 수신될 수 있는 환경 및 음성 인식 장치의 특징에 따라 음성을 변환하는 음성 변환 필터를 화자의 음성에 적용하고, 화자의 음성 및 음성 변환 필터를 적용한 상기 화자의 음성을 머신 러닝하여 화자 인식 모델을 획득하게 한다.According to an aspect of the present invention, there is provided a computer-readable recording medium storing instructions for providing a method of generating a speaker recognition model through machine learning, the method comprising: receiving a voice of a speaker to be machine- A voice conversion filter for converting a voice is applied to a voice of a speaker according to the environment in which the voice of the speaker can be received and the characteristics of the voice recognition device and the voice of the speaker to which the voice and voice conversion filter of the speaker is applied is machine- Thereby acquiring a recognition model.

기타 실시예의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.The details of other embodiments are included in the detailed description and drawings.

본 발명은 화자의 음성을 인식한 환경 요소 즉, 음성 발화 장소, 음성 인식 장치 및 화자의 발화 상태와 무관하게 화자의 음성을 인식하여 화자를 구분할 수 있는 효과가 있다.The present invention has an effect of recognizing a speaker by recognizing a speaker's voice irrespective of an environment element that recognizes the speaker's voice, that is, a speech utterance location, a speech recognition apparatus, and a speaker's speech state.

본 발명은 하나의 음성에 다양한 환경 요소 및 화자의 발화 상태를 적용시키는 시뮬레이션과 시뮬레이션의 결과를 기초로 수행한 머신 러닝을 통해 화자를 구분할 수 있는 효과가 있다.The present invention has the effect of distinguishing a speaker through machine learning which is performed based on simulation and simulation results of applying various environmental factors and speaker's speech state to one voice.

본 발명에 따른 효과는 이상에서 예시된 내용에 의해 제한되지 않으며, 더욱 다양한 효과들이 본 명세서 내에 포함되어 있다.The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the specification.

도 1은 본 발명의 일 실시예에 따른 화자 인식 모델 생성 장치의 개략적인 구성을 도시한 것이다.
도 2는 본 발명의 일 실시예에 따른 화자 인식 모델 생성 방법에 따라 머신 러닝을 통해 화자 인식 모델을 생성하는 절차를 도시한 것이다.
도 3은 본 발명의 일 실시예에 따라 머신 러닝을 통해 획득된 가중치를 적용하여 화자를 구분하는 과정을 설명하기 위해 예시적으로 도시한 것이다.
도 4a 내지 도 4b는 본 발명의 일 실시예에 따라 음성 변환 필터에 따라 변환되는 화자의 음성에 대한 그래프를 도시한 것이다.
도 5a 내지 도 5c는 본 발명의 일 실시예에 따라 음성 변환 필터를 적용한 화자의 음성 그래프를 기초로 머신 러닝하여 화자를 구분하는 과정을 설명하기 위해 예시적으로 도시한 것이다. 1 shows a schematic configuration of an apparatus for generating a speaker recognition model according to an embodiment of the present invention.
FIG. 2 illustrates a procedure for generating a speaker recognition model through machine learning according to a speaker recognition model generation method according to an embodiment of the present invention.
FIG. 3 is an exemplary diagram illustrating a process of classifying speakers by applying weights obtained through machine learning according to an embodiment of the present invention.
4A and 4B show graphs of speech of a speaker converted according to a speech conversion filter according to an embodiment of the present invention.
FIGS. 5A to 5C are exemplary views illustrating a process of classifying a speaker by machine learning based on a speech graph of a speaker to which a speech conversion filter is applied according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

본 발명의 실시예를 설명하기 위한 도면에 개시된 형상, 크기, 비율, 각도, 개수 등은 예시적인 것이므로 본 발명이 도시된 사항에 한정되는 것은 아니다. 또한, 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명은 생략한다. 본 명세서 상에서 언급된 '포함한다', '갖는다', '이루어진다' 등이 사용되는 경우, '~만'이 사용되지 않는 이상 다른 부분이 추가될 수 있다. 구성요소를 단수로 표현한 경우에 특별히 명시적인 기재 사항이 없는 한 복수를 포함하는 경우를 포함한다.The shapes, sizes, ratios, angles, numbers, and the like disclosed in the drawings for describing the embodiments of the present invention are illustrative, and thus the present invention is not limited thereto. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail. Where the terms 'comprises', 'having', 'done', and the like are used herein, other parts may be added as long as '~ only' is not used. Unless the context clearly dictates otherwise, including the plural unless the context clearly dictates otherwise.

구성요소를 해석함에 있어서, 별도의 명시적 기재가 없더라도 오차 범위를 포함하는 것으로 해석한다.In interpreting the constituent elements, it is construed to include the error range even if there is no separate description.

비록 제1, 제2 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않는다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although the first, second, etc. are used to describe various components, these components are not limited by these terms. These terms are used only to distinguish one component from another. Therefore, the first component mentioned below may be the second component within the technical spirit of the present invention.

별도로 명시하지 않는 한 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Like reference numerals refer to like elements throughout the specification unless otherwise specified.

본 발명의 여러 실시예들의 각각 특징들이 부분적으로 또는 전체적으로 서로 결합 또는 조합 가능하며, 당업자가 충분히 이해할 수 있듯이 기술적으로 다양한 연동 및 구동이 가능하며, 각 실시예들이 서로에 대하여 독립적으로 실시 가능할 수도 있고 연관 관계로 함께 실시 가능할 수도 있다.It is to be understood that each of the features of the various embodiments of the present invention may be combined or combined with each other partially or entirely and technically various interlocking and driving is possible as will be appreciated by those skilled in the art, It may be possible to cooperate with each other in association.

도 1은 본 발명의 일 실시예에 따른 화자 인식 모델 생성 장치의 개략적인 구성을 도시한 것이다.1 shows a schematic configuration of an apparatus for generating a speaker recognition model according to an embodiment of the present invention.

도 1을 참조하면, 화자 인식 모델 생성 장치 (100) 는 통신부 (110), 프로세서 (120) 및 저장부 (130) 를 포함한다.Referring to FIG. 1, a speaker recognition model generation apparatus 100 includes a communication unit 110, a processor 120, and a storage unit 130.

통신부 (110) 는 머신 러닝 하고자 하는 화자의 음성을 음성 인식 장치로부터 수신할 수 있다. 이 때, 음성 인식 장치는 마이크, 마이크가 내장된 또는 마이크와 연결된 태블릿 PC, 스마트 폰, 랩탑, 데스크 탑 등 일 수 있으나, 이에 제한되지 않는다. 통신부 (110) 가 수신한 화자의 음성은 m4a, acc, mkv, mp3 등의 확장자를 갖는 음성일 수 있으나, 이에 제한되지 않는다. The communication unit 110 can receive a voice of a speaker to be machine-operated from the voice recognition apparatus. At this time, the speech recognition device may be, but is not limited to, a microphone, a tablet PC with a built-in microphone or a microphone, a smart phone, a laptop, a desktop, and the like. The voice of the speaker received by the communication unit 110 may be a voice having an extension such as m4a, acc, mkv, mp3, but is not limited thereto.

통신부 (110) 는 화자의 음성에 음성 변환 필터를 적용시키기 위해서 수신한 화자의 음성을 프로세서 (120) 로 전송할 수 있고, 음성 변환 필터가 적용된 화자의 음성을 프로세서 (120) 로부터 수신할 수 있다. The communication unit 110 may transmit the voice of the received speaker to the processor 120 and may receive the voice of the speaker to which the voice conversion filter is applied from the processor 120 in order to apply the voice conversion filter to the voice of the speaker.

프로세서 (120) 는 화자의 음성에 음성 변환 필터를 적용한다. 또한, 프로세서 (120) 는 화자의 음성 및 음성 변환 필터를 적용한 화자의 음성을 머신 러닝하여 화자 인식 모델을 획득한다.The processor 120 applies a speech conversion filter to the speech of the speaker. In addition, the processor 120 obtains a speaker recognition model by machine-running the speaker's voice using the speaker's voice and voice conversion filter.

저장부 (130) 는 화자의 음성, 음성 변환 필터, 음성 변환 필터가 적용된 화자의 음성 등 머신 러닝에 필요한 데이터를 저장할 수 있다. 더 나아가, 저장부 (130) 는 화자의 음성에 대한 데이터베이스를 생성한다. 또한, 저장부 (130) 는 프로세서 (120) 가 생성한 화자 인식 모델을 저장할 수도 있다.The storage unit 130 may store data necessary for machine learning, such as a speaker's voice, a voice conversion filter, and a voice of a speaker to which a voice conversion filter is applied. Furthermore, the storage unit 130 generates a database of the speaker's voice. Also, the storage unit 130 may store the speaker recognition model generated by the processor 120.

이에, 통신부 (110) 는 저장부 (130) 에 저장되어야 할 머신 러닝 하고자 하는 화자의 음성, 음성 변환 필터 및 음성 변환 필터가 적용된 화자의 음성을 저장부 (130) 로 전송할 수 있다. 또한, 통신부 (110) 는 저장부 (130) 에 저장된 화자의 음성, 음성 변환 필터 및 음성 변환 필터가 적용된 화자의 음성을 저장부 (130) 로부터 수신할 수 있다.The communication unit 110 may transmit the speech of the speaker to be stored in the storage unit 130 to the storage unit 130, and the speech of the speaker to which the speech conversion filter is applied. The communication unit 110 may receive the voice of the speaker stored in the storage unit 130, the voice conversion filter, and the voice of the speaker to which the voice conversion filter is applied from the storage unit 130.

이에 따라, 화자 인식 모델 생성 장치 (100) 는 통신부 (110) 로부터 머신 러닝 하고자 하는 화자의 음성을 수신하고, 프로세서 (140) 를 통해 화자의 음성에 다양한 음성 변환 필터를 적용하여 화자의 음성을 다양하게 시뮬레이션한다. 또한, 화자 인식 모델 생성 장치 (100) 는 저장부 (130) 를 통해 화자의 음성에 대한 데이터베이스를 생성하고, 프로세서 (140) 를 통해 화자의 음성 및 음성 변환 필터를 적용한 화자의 음성을 머신 러닝하여 화자 인식 모델을 생성한다. 따라서, 화자 인식 모델 생성 장치 (100) 는 화자 인식 모델을 기초로 화자의 음성을 수신한 환경 및 음성 인식 장치의 특징에 영향을 받지 않고 화자의 음성을 정확하게 인식할 수 있다.Accordingly, the speaker recognition model generation apparatus 100 receives a voice of a speaker to be machine-operated from the communication unit 110, applies various voice conversion filters to the speaker's voice through the processor 140, Lt; / RTI > In addition, the speaker recognition model generation apparatus 100 generates a database of the speaker's voice through the storage unit 130, and machine-runs the speaker's voice to which the speaker's voice and voice conversion filter is applied through the processor 140 A speaker recognition model is created. Therefore, the speaker recognition model generation apparatus 100 can correctly recognize the speaker's voice without being influenced by the environment in which the speaker's speech is received and the characteristics of the speech recognition apparatus, based on the speaker recognition model.

도 1에 도시된 화자 인식 모델 생성 장치 (100) 및 하부 구성 요소들은 각각 하나의 모듈로 도시되었으나, 이에 제한되지 않으며, 하나의 모듈이 둘 이상으로 분리되거나 둘 이상의 모듈은 하나의 모듈로 통합되어 구성될 수도 있다. Although the speaker recognition model generating apparatus 100 and the sub-components shown in FIG. 1 are each shown as one module, the present invention is not limited thereto, and one module may be divided into two or more modules, or two or more modules may be integrated into one module .

이하에서는 화자 인식 모델 생성 장치 (100) 에서의 머신 러닝을 통한 화자 인식 모델 생성 방법에 대한 보다 상세한 설명을 위해 도 2를 함께 참조한다.Hereinafter, FIG. 2 will be referred to for a more detailed description of a method for generating a speaker recognition model through machine learning in the speaker recognition model generation apparatus 100. [

도 2는 본 발명의 일 실시예에 따른 화자 인식 모델 생성 방법에 따라 머신 러닝을 통해 화자 인식 모델을 생성하는 절차를 도시한 것이다. 설명의 편의를 위해 도 1의 구성요소들과 도면 부호를 참조하여 설명한다.FIG. 2 illustrates a procedure for generating a speaker recognition model through machine learning according to a speaker recognition model generation method according to an embodiment of the present invention. For convenience of explanation, the components will be described with reference to FIG. 1 and reference numerals.

통신부 (110) 는 머신 러닝 하고자 하는 화자의 음성을 수신한다 (S210).The communication unit 110 receives a voice of a speaker to be machine-operated (S210).

구체적으로, 통신부 (110) 는 머신 러닝 하고자 하는 하나의 화자에 대한 음성을 수신한다. 여기서, 머신 러닝이란, 처리 장치가 학습 모형을 기초로 외부에서 주어진 데이터를 통해 스스로 학습하는 것으로, 예를 들어, 화자 인식 모델 생성 장치 (100) 가 화자의 음성을 기초로 화자를 구분할 수 있는 기술을 스스로 학습하여 습득하는 것을 의미한다. 통신부 (110) 가 수신한 화자의 음성은 하나의 화자의 음성인 것으로 실시예를 설명하였으나, 몇몇 실시예에서, 복수의 화자의 음성 및 복수의 화자에 대한 복수의 음성일 수 도 있다.Specifically, the communication unit 110 receives a voice for one speaker to be machine-operated. Here, the machine learning is a method in which the processing apparatus learns itself on the basis of data given from the outside on the basis of a learning model. For example, when the apparatus 100 for recognizing a speaker recognition model identifies a speaker on the basis of a speaker's voice To learn by themselves. Although embodiments have been described in which the voice of the speaker received by the communication unit 110 is a voice of one speaker, in some embodiments, there may be a plurality of voice for a plurality of speakers and a plurality of voices for a plurality of speakers.

이어서, 프로세서 (120) 는 화자의 음성이 수신될 수 있는 환경 및 음성 인식 장치의 특징에 따라 음성을 변환하는 음성 변환 필터를 화자의 음성에 적용한다 (S220).Subsequently, the processor 120 applies the speech conversion filter for converting the speech according to the environment in which the speech of the speaker can be received and the characteristics of the speech recognition apparatus, to the speech of the speaker (S220).

프로세서 (120) 는 통신부 (110) 가 수신한 화자의 음성에 음성 변환 필터를 적용한다. 여기서, 음성 변환 필터는 화자의 음성에 이퀄라이저, 리버브, 음성의 속도 및 음성의 높낮이 중 적어도 하나를 변환할 수 있는 필터이다. 즉, 프로세서 (120) 는 음성 변환 필터를 통해 이퀄라이저 변환, 리버브 변환, 음성의 속도 변환 및 음성의 높낮이 변환을 할 수 있다. 여기서, 이퀄라이저 변환은 화자의 음성의 전체적인 진동수를 변환하는 것을 의미한다. 또한, 리버브 변환은 화자의 음성의 두께와 깊이를 변환하는 것을 의미한다. 음성의 속도 변환은 화자의 음성의 출력의 빠르기 변환을 의미한다. 또한, 음성의 높낮이 변환은 화자의 음성의 톤에 대한 변환을 의미한다. The processor 120 applies a voice conversion filter to the voice of the speaker received by the communication unit 110. [ Here, the voice conversion filter is a filter that can convert at least one of the equalizer, the reverb, the speed of voice, and the height of voice to the voice of the speaker. That is, the processor 120 can perform an equalizer conversion, a reverb conversion, a speed change of voice, and a voice height change through a voice conversion filter. Here, the equalizer conversion means to convert the overall frequency of the speech of the speaker. In addition, the reverb conversion means to convert the thickness and depth of the speaker's voice. The speed conversion of speech means a fast conversion of the speech output of the speaker. In addition, the conversion of the voice to the height means conversion of the voice of the speaker to the tone.

이에 따라, 프로세서 (120) 는 화자의 음성에 각각 이퀄라이저, 리버브, 음성의 속도 및 음성의 높낮이 중 적어도 하나를 변경한 음성 변환 필터를 적용할 수 있다. 화자의 음성에 음성 변환 필터를 적용함으로써 다양한 환경 및 음성 인식 장치의 특징에 따라 변환하는 화자의 음성을 각각 수신하지 않아도 하나의 화자의 음성으로 복수의 다양한 환경 및 음성 인식 장치의 특징에 따라 변환하는 화자의 음성을 시뮬레이션할 수 있다. 구체적인 화자의 음성에 음성 변환 필터를 적용한 그래프는 도 4를 통해 후술한다.Accordingly, the processor 120 may apply a voice conversion filter to the speaker's voice in which at least one of the equalizer, the reverberation, the speed of the voice, and the height of the voice is changed. The voice conversion filter is applied to the voice of the speaker to convert it according to the features of the various environments and the voice recognition device with the voice of one speaker without receiving the voice of the speaker to convert according to the features of the voice recognition device The speaker's voice can be simulated. A graph in which a voice conversion filter is applied to a voice of a specific speaker will be described later with reference to FIG.

추가적으로, 저장부 (130) 는 음성 변환 필터가 적용된 화자의 음성을 저장한 데이터베이스를 생성할 수 있다. In addition, the storage unit 130 may generate a database storing speech of the speaker to which the speech conversion filter is applied.

구체적으로, 저장부 (130) 는 이퀄라이저, 리버브, 음성의 속도 및 음성의 높낮이 중 적어도 하나가 변경된 화자의 음성을 저장한 데이터베이스를 생성할 수 있다. 저장부 (130)는 음성 변환 필터가 적용되지 않은 화자의 음성도 포함하여 데이터베이스를 생성할 수 도 있다. Specifically, the storage unit 130 may generate a database storing speech of the speaker whose at least one of the equalizer, the reverb, the speed of the voice, and the height of the voice is changed. The storage unit 130 may also include a voice of a speaker to which the voice conversion filter is not applied to generate a database.

이어서, 프로세서 (120) 는 화자의 음성 및 음성 변환 필터를 적용한 화자의 음성을 머신 러닝하여 음성 인식 모델을 획득한다 (S230).Next, the processor 120 obtains a speech recognition model by machine-running the speech of the speaker to which the speech and speech conversion filter of the speaker is applied (S230).

몇몇 실시예에서, 프로세서 (120) 는 머신 러닝하여 화자의 음성을 기초로 음성 변환 필터에 대한 제1 가중치 및 제1 가중치보다 높은 화자의 음성에 대한 제2 가중치를 획득한다. 이 때, 제1 가중치는 음성 변환 필터가 적용된 화자의 음성 즉, 변환된 화자의 음성을 더 작게 나타내어 변환되지 않은 화자의 음성을 식하기 위해 획득된 가중치이다. 또한, 제2 가중치란, 음성 변환 필터보다 화자의 음성을 더 크게 나타내어 화자의 음성을 인식하기 위해 획득된 가중치이다. 이 때, 프로세서 (120) 는 다양한 환경 및 음성 인식 장치의 특징에 따라 변환된 화자의 음성을 나타내기 위해 화자의 음성에 음성 변환 필터를 적용한다. In some embodiments, the processor 120 may machine-run to obtain a first weight for the speech translation filter and a second weight for speech of the speaker higher than the first weight based on the speech of the speaker. In this case, the first weight is a weight obtained by expressing the speech of the speaker to which the speech conversion filter is applied, that is, the speech of the converted speaker, smaller than the speech of the speaker to which the speech is not converted. The second weight is a weight obtained by recognizing the speaker's voice more clearly than the voice conversion filter. At this time, the processor 120 applies a voice conversion filter to the voice of the speaker to express the voice of the converted speaker according to various environments and features of the voice recognition apparatus.

예를 들어, 프로세서 (120) 는 음성 변환 필터의 제1 가중치를 0.1, 화자의 음성의 제2 가중치를 0.9로 즉, 제1 가중치보다 높은 제2 가중치를 획득할 수 있다. 따라서, 프로세서 (120) 는 머신 러닝을 통해 음성 변환 필터가 적용된 음성 즉, 변환된 화자의 음성으로부터 낮은 가중치를 획득하고, 변환되지 않은 화자의 음성으로부터 높은 가중치를 획득함으로써 환경 및 음성 인식 장치의 특징에 따라 변환된 화자의 음성을 수신하여도 화자를 인식할 수 있다. 구체적인 머신 러닝을 통해 획득한 제1 가중치 및 제2 가중치가 적용되어 화자의 음성을 구분하는 과정에 대해서는 도 3을 참조하여 설명한다. 따라서, 프로세서 (120) 는 머신 러닝의 결과를 기초로 화자 인식 모델을 생성한다. 통신부 (110) 가 화자의 음성을 수신하는 경우, 프로세서 (120) 는 화자 인식 모델을 화자의 음성에 적용하여 화자의 음성이 환경 및 음성 인식 장치의 특징에 따라 변환되었어도 화자를 인식한다.For example, the processor 120 may obtain a second weight higher than the first weight, i.e., a first weight of the speech transform filter is 0.1, and a second weight of the speech of the speaker is 0.9. Accordingly, the processor 120 can obtain a low weight from the voice to which the voice conversion filter is applied through the machine learning, that is, the voice of the converted speaker, and obtain a high weight from the voice of the untranslated speaker, The speaker can recognize the speaker even if it receives the voice of the converted speaker. The process of dividing the speaker's voice by applying the first weight and the second weight, which are obtained through specific machine learning, will be described with reference to FIG. Thus, the processor 120 creates a speaker recognition model based on the results of the machine learning. When the communication unit 110 receives the speaker's voice, the processor 120 applies the speaker recognition model to the speaker's voice to recognize the speaker even if the speaker's voice is converted according to the characteristics of the environment and the voice recognition apparatus.

이에 따라, 화자 인식 모델 생성 장치 (100) 는 화자의 음성과 변환된 다양한 화자의 음성을 기초로 머신 러닝하여 화자 인식 모델을 생성함으로써, 화자의 음성이 환경이나 장치의 특성에 따라 변환되어도 정확히 화자를 인식할 수 있다.Accordingly, the speaker recognition model generation apparatus 100 generates a speaker recognition model by machine-learning based on the speaker's voice and the voice of the various speakers converted, so that even if the speaker's voice is converted according to the environment or the characteristics of the apparatus, Can be recognized.

도 3은 본 발명의 일 실시예에 따라 머신 러닝을 통해 획득된 가중치를 적용하여 화자를 구분하는 과정을 설명하기 위해 예시적으로 도시한 것이다.FIG. 3 is an exemplary diagram illustrating a process of classifying speakers by applying weights obtained through machine learning according to an embodiment of the present invention.

도 3을 참조하면, 음성 변환 필터 그래프 (310) 는 화자의 음성에 적용된 음성 변환 필터 그래프 (311) 및 머신 러닝을 통해 획득한 가중치가 적용된 가중치 적용 음성 변환 필터 그래프 (312) 를 포함한다.Referring to FIG. 3, the speech conversion filter graph 310 includes a speech conversion filter graph 311 applied to speech of a speaker and a weighted speech conversion filter graph 312 to which weights obtained through machine learning are applied.

도 3을 참조하면, 화자의 음성에 적용된 음성 변환 필터 그래프 (311) 는 음성 변환 필터가 적용된 화자의 음성 그래프이다. 화자의 음성에 적용된 음성 변환 필터 그래프 (311) 는 음성 변환 필터가 적용된, 주파수의 변화에 따른 화자의 음성 특징의 변동을 나타낸다. 가중치 적용 음성 변환 필터 그래프 (312) 는 화자의 음성에 적용된 음성 변환 필터 그래프 (311) 에 가중치가 적용된 화자의 음성 변환 필터 그래프이다. 예를 들어, 화자의 음성에 적용된 음성 변환 필터 그래프 (311) 에 머신 러닝을 통해 획득한 0.1의 가중치가 적용되는 경우, 주파수 각각마다 화자의 음성에 적용된 음성 변환 필터 그래프 (311) 의 크기 값이 0.1배가 된다.Referring to FIG. 3, a speech conversion filter graph 311 applied to speech of a speaker is a speech graph of a speaker to which a speech conversion filter is applied. The speech conversion filter graph 311 applied to the speech of the speaker indicates the variation of the speech characteristics of the speaker with the change of the frequency to which the speech conversion filter is applied. The weighted speech conversion filter graph 312 is a speech conversion filter graph of the speaker to which the weight is applied to the speech conversion filter graph 311 applied to the speech of the speaker. For example, when a weight of 0.1 obtained through machine learning is applied to the speech conversion filter graph 311 applied to the speaker's voice, the size value of the speech conversion filter graph 311 applied to the speaker's voice for each frequency 0.1 times.

도 3을 참조하면, 화자의 음성 그래프 (320) 는 화자의 원본 음성 그래프 (321) 및 머신 러닝을 통해 획득한 가중치가 적용된 가중치 적용 음성 그래프 (322) 를 포함한다.Referring to FIG. 3, the speaker's voice graph 320 includes a speaker's original voice graph 321 and a weighted speech graph 322 to which weights obtained through machine learning are applied.

도 3을 참조하면, 화자의 원본 음성 그래프 (321) 는 머신 러닝 하고자 수신한 화자의 음성 그래프이다. 화자의 원본 음성 그래프 (321) 는 주파수의 변화에 따른 화자의 음성 특징의 변동을 나타낸다. 가중치 적용 음성 그래프 (322) 는 화자의 원본 음성 그래프 (321) 에 가중치가 적용된 화자의 원본 음성 그래프이다. 예를 들어, 화자의 원본 음성 그래프 (321) 에 0.9의 가중치가 적용되는 경우, 주파수 각각마다 화자의 원본 음성 그래프 (321) 의 크기 값이 0.9배가 된다.Referring to FIG. 3, the original voice graph 321 of the speaker is a voice graph of a speaker received for machine learning. The original speech graph 321 of the speaker indicates the variation of the speech characteristic of the speaker according to the change of the frequency. The weighted speech graph 322 is the original speech graph of the speaker to which the weight is applied to the original speech graph 321 of the speaker. For example, when a weight of 0.9 is applied to the original speech graph 321 of the speaker, the size of the original speech graph 321 of the speaker is 0.9 times for each frequency.

도 3을 참조하면, 화자 인식 그래프 (330) 는 가중치 부여 음성 변환 필터 그래프 (312) 및 가중치 적용 음성 그래프 (322) 의 합을 통해 인식한 화자의 음성 그래프이다. 가중치 적용 음성 변환 필터 그래프 (312) 에 적용된 가중치보다 가중치 부여 음성 그래프 (322) 에 적용된 가중치가 크기 때문에, 화자 인식 그래프 (330) 는 화자의 원본 음성 그래프 (321) 와 유사하다. 즉, 화자의 음성을 수신하는 환경이나 음성 인식 장치의 특징에 따른 음성 변환을 감소시키고 본래 화자의 음성 특징을 강화시켜, 화자 인식 모델 생성 장치 (100) 가 음성 변환에 따른 영향을 적게 받으면서 화자의 음성을 인식할 수 있다.Referring to FIG. 3, the speaker recognition graph 330 is a speech graph of a speaker recognized through the sum of the weighted speech conversion filter graph 312 and the weighted speech graph 322. The speaker recognition graph 330 is similar to the original speech graph 321 of the speaker because the weights applied to the weighted speech graph 322 are greater than the weights applied to the weighted speech translation filter graph 312. [ That is, the speech conversion according to the environment of the speaker or the characteristic of the speech recognition apparatus is reduced and the speech characteristic of the original speaker is strengthened, so that the speaker recognition model generation apparatus 100 is less affected by the speech conversion, Voice can be recognized.

이에 따라, 화자 인식 모델 생성 장치 (100) 는 머신 러닝을 통해 환경 및 음성 인식 장치에 따라 변환된 화자의 음성에 낮은 가중치를 적용하고, 화자의 음성에 높은 가중치를 적용하여, 환경이나 장치의 특성에 따라 화자의 음성이 변환되어도 화자를 보다 정확하게 구분할 수 있도록 한다.Accordingly, the speaker recognition model generation apparatus 100 applies a low weight to the speech of the converted speaker according to the environment and the speech recognition apparatus through machine learning, applies a high weight to the speech of the speaker, So that the speaker can be more accurately distinguished even if the speaker's voice is converted.

도 4a 내지 도 4b는 본 발명의 일 실시예에 따라 음성 변환 필터에 따라 변환되는 화자의 음성에 대한 그래프를 도시한 것이다. 4A and 4B show graphs of speech of a speaker converted according to a speech conversion filter according to an embodiment of the present invention.

도 4a를 참조하면, 음성 변환 필터인 이퀄라이저 필터 (410) 는 화자의 음성 (420) 에서 특정 주파수 대역에 포함된 음성의 dB을 조절한다. 이퀄라이저 필터 (410) 는 음성의 dB을 -로 조절함으로써 음성의 크기를 줄일 수 있고, +로 조절함으로써 음성의 크기를 크게 할 수도 있다. 예를 들어, 이퀄라이저 필터 (410) 는 55Hz, 77Hz, 110Hz 대역의 음성을 0 dB (411) 로 설정될 수 있다. 즉, 화자의 음성 (420) 에서 55Hz, 77Hz, 110Hz 대역이 0 dB (411) 로 설정됨으로써 55Hz, 77Hz, 110Hz 대역의 음성의 크기는 변하지 않는다. 또한, 이퀄라이저 필터 (410) 는 220Hz, 311Hz, 440Hz, 622Hz 대역의 음성을 - dB (412) 로 설정할 수 있다. 따라서, 55Hz, 77Hz, 110Hz 대역의 음성의 크기는 줄어든다. 또한, 이퀄라이저 필터 (410) 는 14kHz, 20kHz 대역의 음성을 + dB (413) 로 설정하여, 14kHz, 20kHz 대역의 음성의 크기를 크게 할 수 있다. Referring to FIG. 4A, the equalizer filter 410, which is a speech conversion filter, adjusts the dB of the speech included in a specific frequency band in the speech 420 of the speaker. The equalizer filter 410 may reduce the size of the voice by adjusting the dB of the voice to -, and may increase the size of the voice by adjusting it to +. For example, the equalizer filter 410 may be set to 0 dB (411) for the 55 Hz, 77 Hz, and 110 Hz bands. That is, the 55 Hz, 77 Hz, and 110 Hz bands are set to 0 dB (411) in the speech 420 of the speaker, so that the sizes of the voices in the 55 Hz, 77 Hz, and 110 Hz bands are not changed. In addition, the equalizer filter 410 can set the audio of the 220 Hz, 311 Hz, 440 Hz, and 622 Hz bands to-dB (412). Therefore, the size of voice in the 55 Hz, 77 Hz, and 110 Hz bands is reduced. In addition, the equalizer filter 410 can set the sound of the 14 kHz band and the 20 kHz band to + dB (413), thereby increasing the size of the sound of the 14 kHz and 20 kHz band.

이퀄라이저 필터 (410) 를 화자의 음성 (420) 에 적용함에 따라, 음성 변환 필터를 적용한 화자의 제1 음성 (421) 및 음성 변환 필터 적용한 화자의 제2 음성 (440) 이 생성될 수 있다. 예를 들어, 화자의 음성 (420) 의 전체 주파수 대역에 - dB (412) 를 설정하는 경우, 화자의 음성 (420) 보다 줄어든 음성의 크기를 가지는 화자의 제1 음성 (421) 이 생성될 수 있다. 또한, 화자의 음성 (420) 의 전체 주파수 대역에 + dB (413) 를 설정하는 경우, 화자의 음성 (420) 보다 큰 음성의 크기를 가지는 화자의 제2 음성 (422) 이 생성될 수 있다. By applying the equalizer filter 410 to the speaker's voice 420, a first voice 421 of the speaker to which the voice conversion filter is applied and a second voice 440 of the speaker to which the voice conversion filter is applied can be generated. For example, when-dB (412) is set in the entire frequency band of the speaker's voice 420, the first voice 421 of the speaker having the voice size smaller than the speaker's voice 420 can be generated have. In addition, when + dB 413 is set to the entire frequency band of the speaker's voice 420, a second voice 422 of the speaker having a larger voice size than the speaker's voice 420 can be generated.

이에 따라, 화자 인식 모델 생성 장치는 이퀄라이저 필터 (410) 의 값을 다양하게 조절함으로써 화자의 음성 (420) 으로부터 다양하게 변환된 음성을 생성하여 시뮬레이션하고, 시뮬레이션한 다양한 음성을 기초로 머신 러닝을 수행한다.Accordingly, the speaker recognition model generation apparatus generates and simulates variously converted voices from the speaker's voice 420 by variously adjusting the value of the equalizer filter 410, and performs machine learning based on the simulated various voices do.

도 4b를 참조하면, 음성 변환 필터인 음성의 속도 및 높낮이 필터 (430) 는 화자의 음성 (440) 의 속도를 조절한다. 음성의 속도 및 높낮이 필터 (430) 는 음성의 속도를 조절함으로써, 음성의 높낮이도 동시에 조절할 수 있다. 구체적으로, 음성의 속도 및 높낮이 필터 (430) 는 화자의 음성 (440) 의 속도 배수 (431) 를 조절하여, 음성의 속도 및 높낮이 변환시킬 수 있다. 예를 들어, 화자의 음성 (440) 의 속도 배수 (431) 를 1.103으로 설정한 경우, 화자의 음성 (440) 의 속도는 1.103배 빨라지며, 빨라진 음성의 속도에 따라 화자의 음성 (440) 의 높낮이도 변환될 수 있다. 또한, 음성의 속도 및 높낮이 필터 (430) 는 화자의 음성 (440) 의 길이를 조절하여, 음성의 속도 및 높낮이를 변환시킬 수 있다. 예를 들어, 화자의 음성 (440) 의 현재 길이가 1.980sec 이고, 변경 길이 (432) 를 1.795sec로 설정한 경우, 기존의 화자의 음성 (440) 의 길이가 짧아지면서 음성의 속도가 빨라지고, 높낮이도 변환될 수 있다.Referring to FIG. 4B, the speech speed and the speech level filter 430, which are speech conversion filters, adjust the speed of the speech 440 of the speaker. The speed of voice and the height of the filter 430 can adjust the height of the voice simultaneously by adjusting the speed of the voice. Specifically, the speech speed and pitch filter 430 can adjust the speed multiples of speech 440 of the speaker 440 to change the speech speed and pitch. For example, when the speed ratio 431 of the speech 440 of the speaker is set to 1.103, the speed of the speaker's voice 440 is 1.103 times faster and the speed of the speaker's voice 440 The elevation can also be converted. In addition, the speed of speech and the height filter 430 can adjust the length of the speech 440 of the speaker to change the speed and the height of the speech. For example, when the current length of the speech 440 of the speaker is 1.980 sec and the modification length 432 is 1.795 sec, the length of the speech 440 of the existing speaker is shortened, The elevation can also be converted.

음성의 속도 및 높낮이 필터 (430) 를 화자의 음성 (440) 에 적용함에 따라, 음성 변환 필터를 적용한 화자의 제3 음성 (441) 및 음성 변환 필터를 적용한 화자의 제4 음성 (442) 이 생성될 수 있다. 예를 들어, 화자의 음성 (430) 에 1 이상의 속도 배수 (431) 를 적용하면 화자의 음성 (440) 보다 빠른 속도의 음성인 화자의 제3 음성 (441) 이 생성될 수 있다. 생성된 제3 음성 (441) 은 화자의 음성 (440) 보다 빠른 속도로 출력되고, 화자의 음성 (440) 보다 높은 주파수를 갖게 된다. 이에 따라, 생성된 제3 음성 (441) 의 음의 높이도 화자의 음성 (440) 의 음의 높이보다 높아진다. 또한, 화자의 음성 (430) 에 1 미만의 속도 배수 (431) 를 적용하면 화자의 음성 (440) 보다 느린 속도의 음성인 화자의 제4 음성 (442) 이 생성될 수 있다. 생성된 제4 음성 (442) 은 화자의 음성 (440) 보다 느린 속도로 출력되고, 화자의 음성 (440) 보다 낮은 주파수를 갖게 된다. 이에 따라, 생성된 제4 음성 (442) 의 음의 높이도 화자의 음성 (440) 의 음의 높이보다 낮아진다. 따라서, 화자 인식 모델 생성 장치는 음성의 속도 및 높낮이 필터 (430) 의 값을 상이하게 화자의 음성 (440) 에 적용하여 음성의 빠르기와 높낮이가 다양한 음성을 시뮬레이션하고, 시뮬레이션한 음성을 기초로 머신 러닝을 수행한다. As the speech speed and pitch filter 430 is applied to speech 440 of the speaker, a third speech 441 of the speaker to which the speech conversion filter is applied and a fourth speech 442 of the speaker to which the speech conversion filter is applied . For example, if the speaker's voice 430 is applied to one or more speed multiples 431, a third voice 441 of the speaker can be generated which is faster than the speaker's voice 440. The generated third voice 441 is output at a higher speed than the voice 440 of the speaker and has a higher frequency than the voice 440 of the speaker. Accordingly, the sound height of the generated third sound 441 becomes higher than the sound height of the sound 440 of the speaker. In addition, applying a rate multiple 431 less than 1 to the speaker's voice 430 may produce a fourth voice 442 of the speaker, which is a voice at a slower rate than the speaker's voice 440. The generated fourth voice 442 is output at a slower rate than the voice 440 of the speaker and has a lower frequency than the voice 440 of the speaker. Accordingly, the sound height of the generated fourth sound 442 becomes lower than the sound height of the speaker sound 440. Therefore, the speaker recognition model generation apparatus applies the speech speed and the value of the height filter 430 to the speech 440 of the speaker differently to simulate a speech having a variety of speech speeds and heights. Based on the simulated speech, Run.

이에 따라, 화자 인식 모델 생성 장치는 화자의 음성에 이퀄라이저 필터와 음성의 속도 및 높낮이 필터를 적용하여, 화자의 음성이 수신될 수 있는 다양한 환경 및 다양한 음성 인식 장치의 특성에 대해 시뮬레이션함으로써 보다 정교하게 머신 러닝을 수행할 수 있다.Accordingly, the speaker recognition model generation apparatus can apply the equalizer filter, the speed of speech, and the height filter to the speech of the speaker to simulate the characteristics of various speech recognition apparatuses and various environments in which the speech of the speaker can be received, Machine running can be performed.

도 5a 내지 도 5c는 본 발명의 일 실시예에 따라 음성 변환 필터를 적용한 화자의 음성 그래프를 기초로 머신 러닝하여 화자를 구분하는 과정을 설명하기 위해 예시적으로 도시한 것이다.FIGS. 5A to 5C are exemplary views illustrating a process of classifying a speaker by machine learning based on a speech graph of a speaker to which a speech conversion filter is applied according to an embodiment of the present invention.

도 5a를 참조하면, 제1 음성 변환 필터 적용 음성 그래프 (510) 는 음성 변환 필터가 적용된 화자의 음성의 시간에 따른 제1 주파수 변화 그래프 (511) 와 제1 주파수 변화 그래프 (511) 에 나타난 화자의 음성의 제1 음향적인 특징 (512) 을 포함한다. 또한, 제1 음성 변환 필터 적용 음성 그래프 (510) 는 제1 주파수 변화 그래프 (511) 에 나타난 음성 변환 필터에 따른 제1 화자의 음성 추세선 (513) 을 포함한다. 이 때, 제1 화자의 음성 추세선 (513) 은 화자의 음성이 수신된 환경 및 음성 인식 장치의 특징에 대한 음성 변환 필터에 따른 것이다. Referring to FIG. 5A, the first speech conversion applied speech graph 510 includes a first frequency change graph 511 according to a time of speech of a speaker to which a speech transform filter is applied, a speaker 511 shown in the first frequency change graph 511, The first acoustic feature 512 of the voice. In addition, the first speech translation applied speech graph 510 includes a first speech speech trend line 513 according to the speech translation filter shown in the first frequency change graph 511. At this time, the first speaker's voice trend line 513 is based on the environment in which the speaker's voice is received and the voice conversion filter for the characteristics of the voice recognition apparatus.

도 5b를 참조하면, 제2 음성 변환 필터 적용 음성 그래프 (520) 는 음성 변환 필터가 적용된 화자의 음성의 시간에 따른 제2 주파수 변화 그래프 (521) 와 제2 주파수 변화 그래프 (521) 에 나타난 화자의 음성의 제2 음향적인 특징 (522) 을 포함한다. 또한, 제2 음성 변환 필터 적용 음성 그래프 (520) 는 제2 주파수 변화 그래프 (521) 에 나타난 음성 변환 필터에 따른 제2 화자의 음성 추세선 (523) 을 포함한다. 이 때, 제1 화자의 음성 추세선 (523) 은 화자의 음성이 수신된 환경 및 음성 인식 장치의 특징에 대한 음성 변환 필터에 따른 것이다. Referring to FIG. 5B, the second speech conversion applied speech graph 520 includes a second frequency change graph 521 according to the time of the speech of the speaker to which the speech transform filter is applied and a second frequency change graph 521 And a second acoustic feature 522 of voice. In addition, the second speech translation applied speech graph 520 includes the second speaker's speech tendency line 523 according to the speech translation filter shown in the second frequency change graph 521. At this time, the voice talker line 523 of the first speaker is based on the environment in which the speaker's voice is received and the voice conversion filter for the characteristics of the voice recognition apparatus.

도 5a 내지 도 5b를 참조하면, 제1 음성 변환 필터 적용 음성 그래프 (510) 및 제2 음성 변환 필터 적용 음성 그래프 (520) 에 적용된 화자의 음성은 동일하다. 따라서, 동일한 화자가 발성하고, 동일한 단어를 포함한 음성이어도 음성 변환 필터에 따라 상이한 그래프가 도출될 수 있다. 즉, 동일한 음성이어도 음성 변환 필터에 따라 음향적인 특징이 다르게 산출될 수 있다. 5A and 5B, the voice of the speaker applied to the voice graph 510 applied to the first voice conversion filter and the voice graph 520 applied to the second voice conversion filter are the same. Therefore, even if the same speaker is uttered and the voice including the same word is used, a different graph can be derived according to the voice conversion filter. That is, even though the same voice is used, the acoustic characteristics can be calculated differently according to the voice conversion filter.

이에, 화자 인식 모델 생성 장치는 제1 음향적인 특징 (512) 및 제2 음향적인 특징 (522) 에 높은 가중치를 적용하여 다양한 음성 변환 필터를 통해 화자의 음성이 변환되더라도 음성 변환 필터의 적용과 무관하게 화자의 음성 특징을 기초로 머신 러닝할 수 있다.Accordingly, the speaker recognition model generation apparatus applies a high weight to the first acoustic feature 512 and the second acoustic feature 522, so that even if the speaker's voice is converted through various voice conversion filters, So that the machine can run on the basis of the speech characteristics of the speaker.

즉, 화자 인식 모델 생성 장치는 음성 변환 필터에 따른 화자의 음성 추세선에 낮은 가중치를 부여하고, 화자의 음성의 음향적인 특징에 높은 가중치를 부여함으로써 음성 변환 필터에 따른 화자의 음성 변화를 최소화하여 머신 러닝을 수행할 수 있다. That is, the speaker recognition model generation apparatus lowers the speaker's voice trend line according to the voice conversion filter and gives a high weight to the acoustic characteristic of the speaker's voice, thereby minimizing the voice change of the speaker according to the voice conversion filter, Running can be performed.

이에 따라, 화자 인식 모델 생성 장치는 머신러닝을 수행함으로써, 화자의 음성의 특성을 통해 화자를 인식할 수 있도록 한다.Accordingly, the speaker recognition model generation apparatus performs machine learning, so that the speaker can be recognized through the characteristics of the speaker's voice.

본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서에 의해 실행되는 하드웨어, 소프트웨어 모듈 또는 그 2 개의 결합으로 직접 구현될 수도 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM 또는 당업계에 알려진 임의의 다른 형태의 저장 매체에 상주할 수도 있다. 예시적인 저장 매체는 프로세서에 커플링되며, 그 프로세서는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로 (ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which is capable of reading information from, and writing information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것은 아니고, 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it is to be understood that the present invention is not limited to those embodiments and various changes and modifications may be made without departing from the scope of the present invention. . Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100: 화자 인식 모델 생성 장치
110: 통신부
120: 프로세서
130: 저장부
310: 음성 변환 필터 그래프
311: 음성 변환 필터 그래프
311: 화자의 음성에 적용된 음성 변환 필터 그래프
312: 가중치 부여 음성 변환 필터 그래프
320: 화자의 음성 그래프
321: 화자의 원본 음성 그래프
322: 가중치 부여 음성 그래프
330: 화자 인식 그래프
410: 이퀄라이저 필터
411: 0 dB
412: - dB
412: + dB
420: 화자의 음성
421: 제1 음성
422: 제2 음성
430: 음성의 속도 및 높낮이 필터
431: 속도 배수
432: 변경 길이
440: 화자의 음성
441: 제3 음성
442: 제4 음성
510: 제1 음성 변환 필터 적용 음성 그래프
511: 제1 주파수 변화 그래프
512: 제1 음향적인 특징
513: 제1 화자의 음성 추세선
520: 제2 음성 변환 필터 적용 음성 그래프
521: 제2 주파수 변화 그래프
522: 제2 음향적인 특징
523: 제2 음향적인 특징100: Speaker recognition model generation device
110:
120: Processor
130:
310: Voice Conversion Filter Graph
311: Voice Conversion Filter Graph
311: Voice conversion filter graph applied to speaker's voice
312: Weighted speech conversion filter graph
320: Speaker's voice graph
321: Speaker's original voice graph
322: weighted speech graph
330: Speaker recognition graph
410: Equalizer filter
411: 0 dB
412: - dB
412: + dB
420: Speaker's Voice
421: First voice
422: second voice
430: Speech speed and elevation filter
431: Speed multiplier
432: Change length
440: Speaker's voice
441: Third voice
442: fourth voice
510: First speech conversion filter applied speech graph
511: First frequency change graph
512: First Acoustic Feature
513: Speech trend line of the first speaker
520: Second speech conversion filter applied speech graph
521: Second frequency change graph
522: Second Acoustic Features
523: Second Acoustic Features

Claims

Receiving a voice of a speaker to be machine learning;
Applying a speech conversion filter to the speech of the speaker to convert speech according to the environment in which the speech of the speaker can be received and the characteristics of the speech recognition apparatus; And
And machine learning a voice of the speaker to which the voice of the speaker and the voice of the speaker to which the voice conversion filter is applied is acquired.

The method according to claim 1,
Wherein the step of acquiring the speaker recognition model comprises:
And performing a machine learning of the speech of the speaker to which the speech of the speaker and the speech of the speaker to which the speech conversion filter is applied to obtain a first weight of the speech conversion filter and a second weight of the speech of the speaker, How to create a model.

3. The method of claim 2,
Wherein the obtaining of the first weight and the second weight comprises:
And acquiring the first weight and the second weight higher than the first weight through machine learning.

The method according to claim 1,
Wherein applying the speech translation filter comprises:
Applying the speech conversion filter to the speech of the speaker to change at least one of an equalizer, a reverb, a speed of the speech, and a height of the speech, respectively.

The method according to claim 1,
Wherein the step of acquiring the speaker recognition model comprises:
Generating a database storing speech of the speaker to which the speech conversion filter is applied; And
Learning the speech characteristics of the speaker corresponding to the speech conversion filter based on the speech of the speaker stored in the database.

The method according to claim 1,
When the speech of the speaker is received,
Further comprising the step of recognizing the speaker by applying the speaker recognition model to the speech of the speaker.

A communication unit for receiving a voice of a speaker to be machine-operated; And
And a processor for applying a speech conversion filter to the speech of the speaker to convert the speech according to the environment in which the speech of the speaker can be received and the characteristics of the speech recognition apparatus,
The processor comprising:
And a speaker recognition model is obtained by machine-running the speech of the speaker to which the speech of the speaker and the speech of the speaker to which the speech conversion filter is applied.

8. The method of claim 7,
The processor comprising:
Wherein the voice conversion filter is adapted to apply at least one of an equalizer, a reverb, a speed of the voice and a height of the voice to the voice of the speaker.

8. The method of claim 7,
Further comprising a storage unit for storing speech of the speaker to which the speech conversion filter is applied,
Wherein,
Generates a database of the speech of the speaker,
The processor comprising:
And learning the speech characteristics of the speaker corresponding to the speech conversion filter based on the speech of the speaker stored in the database.

8. The method of claim 7,
The processor comprising:
And machine-runs the speech of the speaker applying the speech of the speaker and the speech conversion filter to obtain a first weight of the speech conversion filter and a second weight of the speech of the speaker.

8. The method of claim 7,
The processor comprising:
When the speech of the speaker is received,
And recognizing the speaker by applying the speaker recognition model to the speech of the speaker.

A voice of a speaker to be machine-operated is received,
Applying a speech conversion filter for converting speech according to the environment in which the speech of the speaker can be received and the characteristics of the speech recognition apparatus to the speech of the speaker,
And a speaker recognition model is acquired by machine-running the speech of the speaker to which the speech of the speaker and the speech conversion filter are applied.