KR20190099988A

KR20190099988A - Device for voice recognition using end point detection and method thereof

Info

Publication number: KR20190099988A
Application number: KR1020180038604A
Authority: KR
Inventors: 윤재선; 김경선; 서은주; 한명수; 이강규
Original assignee: 주식회사 셀바스에이아이
Priority date: 2018-02-19
Filing date: 2018-04-03
Publication date: 2019-08-28
Also published as: KR102101373B1

Abstract

The present invention relates to a method for recognizing a voice using a standard speaker model and a method for recognizing a voice. According to one embodiment of the present invention, the method for recognizing a voice using a standard speaker model comprises the steps of: receiving first trigger voice data from a first speaker; generating a standard speaker model from the first trigger voice data by using a speaker recognition algorithm; receiving command voice data from the first speaker; generating a first command voice segment having a start point and an end point from the command voice data by using a voice area detection algorithm; generating a second command voice segment by correcting the start point and/or the end point of the first command voice segment based on the standard speaker model; and performing a voice recognition with respect to the second command voice segment or providing a voice recognition result, wherein the first command voice segment is corrected based on the first trigger voice data so that it is possible to accurately detect a voice section, thereby increasing performance of the device for recognizing a voice.

Description

Speech recognition device using reference speaker model and speech recognition method using same {DEVICE FOR VOICE RECOGNITION USING END POINT DETECTION AND METHOD THEREOF}

본 발명은 기준 화자 모델을 이용한 음성 인식 장치 및 이를 이용한 음성 인식 방법에 관한 것으로서, 보다 상세하게는 트리거 음성 데이터로부터 기준 화자 모델을 생성하여 잡음환경에서 음성 인식 성능을 증대시킨, 기준 화자 모델을 이용한 음성 인식 장치 및 이를 이용한 음성 인식 방법에 관한 것이다.The present invention relates to a speech recognition apparatus using a reference speaker model and a speech recognition method using the same, and more particularly, to generate a reference speaker model from trigger speech data to increase the speech recognition performance in a noise environment. The present invention relates to a speech recognition apparatus and a speech recognition method using the same.

음성은 인간이 사용하는 가장 보편적이고 편리한 정보전달의 수단이라 할 수 있다. 음성에 의해 표현되는 말은 인간과 인간 사이의 의사소통의 수단으로서뿐만 아니라 인간의 음성을 이용하여 기계 및 사용 장치를 동작시키는 수단으로서도 중요한 역할을 한다. 최근 컴퓨터 성능의 발전, 다양한 미디어의 개발, 신호 및 정보처리 기술의 발전으로 인해 음성 인식 기술이 발전하고 있다.Voice is the most common and convenient means of information transfer that humans use. Words expressed by voice play an important role not only as a means of communication between human beings but also as a means of operating machines and devices using human voices. Recently, voice recognition technology is being developed due to the development of computer performance, the development of various media, the development of signal and information processing technology.

음성 인식 기술은, 인간의 음성을 컴퓨터가 분석 또는 이해하는 기술로서, 발음에 따라 입 모양과 혀의 위치 변화로 특정한 주파수를 갖는 인간의 음성을 이용하여, 발성된 음성을 전기 신호로 변환한 후 음성 신호의 주파수 특성을 추출해 발음을 인식하는 기술이다.Speech recognition technology is a technology that analyzes or understands human speech by converting the spoken voice into an electrical signal using the human voice having a specific frequency by changing the shape of the mouth and the position of the tongue according to the pronunciation. This technology recognizes pronunciation by extracting frequency characteristics of speech signals.

음성 신호를 입력 받게 되면, 실제 화자가 발성한 음성 부분만을 검출하여야 하는데 이 음성 검출 부분은 음성 인식 성능에 매우 큰 영향을 미친다. 실제의 음성 인식 환경은 주변 소음 등으로 매우 열악하며, 이에 따라 대부분의 음성 인식이 수행되는 환경 내에서 검출된 구간에 잡음이 포함되는 경우가 많다.When the voice signal is input, only the voice part spoken by the actual speaker should be detected, and this voice detection part has a great influence on the voice recognition performance. The actual speech recognition environment is very poor due to ambient noise, and thus the noise is often included in the detected section in the environment where most of the speech recognition is performed.

본 발명이 해결하고자 하는 과제는 화자가 발성한 음성의 정확한 시작점 및 끝점을 검출하고, 음성 인식률을 향상시키는 것이다.The problem to be solved by the present invention is to detect the exact starting point and end point of the speech spoken by the speaker, and improve the speech recognition rate.

본 발명의 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

전술한 바와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법은, 제1 화자로부터 제1 트리거 음성 데이터를 수신하는 단계, 화자 인식 알고리즘을 사용하여, 제1 트리거 음성 데이터로부터 기준 화자 모델을 생성하는 단계, 제1 화자로부터 명령 음성 데이터를 수신하는 단계, 음성 영역 검출 알고리즘을 사용하여, 명령 음성 데이터로부터 시작점 및 끝점을 갖는 제1 명령 음성 세그먼트를 생성하는 단계, 기준 화자 모델에 기초하여 제1 명령 음성 세그먼트의 시작점 및/또는 끝점을 보정하여 제2 명령 음성 세그먼트를 생성하는 단계 및 제2 명령 음성 세그먼트에 대한 음성 인식 수행 또는 음성 인식 결과를 제공하는 단계를 포함한다.In a speech recognition method using a reference speaker model according to an embodiment of the present invention for solving the above problems, the method includes: receiving first trigger voice data from a first speaker, using a speaker recognition algorithm, Generating a reference speaker model from the trigger voice data, receiving command voice data from the first speaker, and generating a first command voice segment having a start point and an end point from the command voice data using a voice region detection algorithm Correcting a start point and / or an end point of the first command voice segment based on the reference speaker model to generate a second command voice segment and performing a voice recognition or providing a voice recognition result for the second command voice segment. Include.

전술한 바와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치는, 트리거 음성 데이터 및 명령 음성 데이터를 수신하는 입력부, 트리거 음성 데이터를 분석하여 기준 화자 모델을 생성하는 화자 모델 생성부, 명령 음성 데이터를 분석하여 시작점 및 끝점을 갖는 제1 명령 음성 세그먼트를 생성하는 음성 세그먼트 생성부 및 기준 화자 모델에 기초하여 제1 명령 음성 세그먼트의 시작점 및/또는 끝점을 보정하여 제2 명령 음성 세그먼트를 생성하는 음성 세그먼트 보정부를 포함한다.Speech recognition apparatus using a reference speaker model according to an embodiment of the present invention for solving the above problems, an input unit for receiving the trigger voice data and command voice data, the trigger voice data is analyzed to generate a reference speaker model A speaker model generator to analyze the command voice data to generate a first command voice segment having a start point and an end point, and correct the start point and / or the end point of the first command voice segment based on the reference speaker model And a voice segment corrector for generating a second command voice segment.

기타 실시예의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Specific details of other embodiments are included in the detailed description and drawings.

본 발명은 정확한 음성 구간을 검출하고 음성 인식률을 향상시킬 수 있다.The present invention can detect the correct speech section and improve the speech recognition rate.

본 발명에 따른 효과는 이상에서 예시된 내용에 의해 제한되지 않으며, 더욱 다양한 효과들이 본 명세서 내에 포함되어 있다.The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the present specification.

도 1a는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 구성을 도시한 블록도이다.
도 1b는 프로세서에 의해 구동되는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 구성을 개략적으로 도시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법을 설명하기 위한 순서도이다.
도 3a 내지 도 3g는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법을 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 메모리부를 개략적으로 도시한 블록도이다.
도 5a 및 도 5b는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 메모리부에 제2 기준 화자 모델을 저장하는 과정을 설명하기 위한 예시적인 도면이다.
도 6a 및 도 6b는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 메모리부에 제2 기준 화자 모델을 저장하는 과정을 설명하기 위한 예시적인 도면이다.1A is a block diagram illustrating a configuration of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.
1B is a block diagram schematically illustrating a configuration of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention driven by a processor.
2 is a flowchart illustrating a speech recognition method using a reference speaker model according to an embodiment of the present invention.
3A to 3G are exemplary diagrams for describing a speech recognition method using a reference speaker model according to an embodiment of the present invention.
4 is a block diagram schematically illustrating a memory unit of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.
5A and 5B are exemplary diagrams for describing a process of storing a second reference speaker model in a memory unit of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.
6A and 6B are exemplary diagrams for describing a process of storing a second reference speaker model in a memory unit of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.

이하의 기술은 유선 또는 CDMA(code division multiple access), FDMA(frequency division multiple access), TDMA(time division multiple access), OFDMA(orthogonal frequency division multiple access), SC-FDMA(single carrier-frequency division multiple access) 등과 같은 다양한 다중 접속 방식 (multiple access scheme)에 사용될 수 있다. CDMA는 UTRA(Universal Terrestrial Radio Access)나 CDMA2000과 같은 무선 기술(radio technology)로 구현될 수 있다.The following techniques are wired or code division multiple access (CDMA), frequency division multiple access (FDMA), time division multiple access (TDMA), orthogonal frequency division multiple access (OFDMA), and single carrier-frequency division multiple access (SC-FDMA) May be used in various multiple access schemes, such as CDMA may be implemented with a radio technology such as Universal Terrestrial Radio Access (UTRA) or CDMA2000.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various forms, and only the present embodiments are intended to complete the disclosure of the present invention, and the general knowledge in the art to which the present invention pertains. It is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined only by the scope of the claims.

또한 제1, 제2 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Also, the first, second, etc. are used to describe various components, but these components are not limited by these terms. These terms are only used to distinguish one component from another. Therefore, of course, the first component mentioned below may be a second component within the technical spirit of the present invention.

명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Like reference numerals refer to like elements throughout.

본 발명의 여러 실시예들의 각각 특징들이 부분적으로 또는 전체적으로 서로 결합 또는 조합 가능하며, 당업자가 충분히 이해할 수 있듯이 기술적으로 다양한 연동 및 구동이 가능하며, 각 실시예들이 서로에 대하여 독립적으로 실시 가능할 수도 있고 연관 관계로 함께 실시 가능할 수도 있다.Each of the features of the various embodiments of the present invention may be combined or combined with each other in part or in whole, various technically interlocking and driving as can be understood by those skilled in the art, each of the embodiments may be implemented independently of each other It may be possible to carry out together in an association.

이하, 첨부된 도면을 참조하여 본 발명의 다양한 실시예들을 상세히 설명한다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1a는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 구성을 도시한 블록도이다.1A is a block diagram illustrating a configuration of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.

이하에서는 음성 인식 장치(100)는 휴대폰, 태블릿, PC, 웨어러블 디바이스, 랩탑, 스틱PC, PMP 및 MP3 플레이어 등을 포함하는 개념으로 전술되지 않은 다양한 종류의 음성 인식 기능을 갖는 임의의 장치를 포함한다.Hereinafter, the speech recognition apparatus 100 includes any device having various kinds of speech recognition functions not described above in the concept of including a mobile phone, a tablet, a PC, a wearable device, a laptop, a stick PC, a PMP, and an MP3 player. .

음성 인식 장치(100)는 입력부(110), 프로세서(120), 메모리부(130) 및 출력부(140)를 포함한다.The speech recognition apparatus 100 includes an input unit 110, a processor 120, a memory unit 130, and an output unit 140.

입력부(110)는, 임의의 화자로부터 발성된 음성 데이터를 수신한다. 입력부(110)는 예를 들어, 음성 인식 장치(100)에 포함된 마이크일 수 있으며, 본 발명은 이에 제한되지 않는다.The input unit 110 receives voice data spoken by an arbitrary speaker. The input unit 110 may be, for example, a microphone included in the voice recognition apparatus 100, and the present invention is not limited thereto.

이러한 입력부(110)는, 필터와 같은 모듈 등을 추가로 구비하여 입력된 음성 데이터에 포함된 잡음 등을 제거하도록 구성될 수도 있다. The input unit 110 may further include a module such as a filter to remove noise included in the input voice data.

프로세서(120)는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치(100)를 이용한 음성 인식 방법을 수행하기 위한 다양한 데이터를 처리한다.The processor 120 processes various data for performing a speech recognition method using the speech recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention.

구체적으로, 프로세서(120)는 화자 인식 알고리즘을 사용하여, 입력부(110)로부터 수신된 음성 데이터를 분석하여 화자 모델을 생성한다. 그리고 프로세서(120)는 음성 영역 검출 알고리즘을 사용하여, 입력부(110)로부터 수신된 음성 데이터를 분석하여 음성 세그먼트를 생성한다.In detail, the processor 120 generates a speaker model by analyzing voice data received from the input unit 110 using a speaker recognition algorithm. The processor 120 generates a voice segment by analyzing voice data received from the input unit 110 using a voice region detection algorithm.

또한, 프로세서(120)는 화자 모델을 비교하여, 음성 세그먼트의 시작점(SP) 및/또는 끝점(EP)을 보정하여 새로운 음성 세그먼트를 생성하고, 새로운 음성 세그먼트에 대한 음성 인식 수행 또는 음성 인식 결과를 제공한다.In addition, the processor 120 compares the speaker model, corrects the start point (SP) and / or end point (EP) of the speech segment to generate a new speech segment, and performs a speech recognition or a speech recognition result for the new speech segment. to provide.

한편, 이러한 음성 인식 장치(100)는 적어도 하나 이상의 프로세서(120)에 해당하거나, 적어도 하나 이상의 프로세서(120)를 포함할 수 있다. 따라서, 음성 인식 장치(100)는 마이크로 프로세서나 범용 컴퓨터 시스템과 같은 하드웨어 장치에 포함된 형태로 구동될 수도 있고, 별도의 장치로 구동될 수도 있다.The voice recognition apparatus 100 may correspond to at least one or more processors 120 or include at least one or more processors 120. Therefore, the speech recognition apparatus 100 may be driven in a form included in a hardware device such as a microprocessor or a general purpose computer system, or may be driven by a separate device.

메모리부(130)는 음성 인식 장치(100)의 동작을 위한 데이터들을 저장할 수 있고, 입/출력되는 데이터들을 임시 저장할 수도 있다. 예를 들어, 메모리부(130)는 입력부(110), 프로세서(120), 출력부(140) 중 적어도 하나의 모듈에 의해 실행될 수 있는 소프트웨어 코드를 저장할 수 있다. 또한, 메모리부(130)는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치(100)가 동작하기 위한 모든 종류의 소프트웨어 코드를 저장할 수 있다.The memory unit 130 may store data for the operation of the speech recognition apparatus 100, and may temporarily store input / output data. For example, the memory unit 130 may store software code that may be executed by at least one module of the input unit 110, the processor 120, and the output unit 140. In addition, the memory unit 130 may store all kinds of software codes for operating the speech recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention.

예를 들어, 메모리부(130)에는 하나 이상의 프로세서(120)에 의해 실행 가능하며, 상기 하나 이상의 프로세서(120)로 하여금 이하의 동작을 수행하도록 하는 명령들을 포함하는, 컴퓨터 판독가능 매체에 저장된, 음성 인식 성능을 향상시키기 위한 컴퓨터 프로그램을 위한 동작들이 저장될 수 있다. 상기 동작들은: 트리거 음성 데이터 및 명령 음성 데이터를 수신하는 동작; 화자 인식 알고리즘을 사용하여, 트리거 음성 데이터로부터 기준 화자 모델을 생성하는 동작; 음성 영역 검출 알고리즘을 사용하여, 음성 데이터로부터 음성 세그먼트를 생성하는 동작; 상기 기준 화자 모델과 상기 음성 세그먼트간의 유사도를 분석하는 동작; 상기 유사도에 기초하여 상기 음성 세그먼트의 시작점(SP) 및/또는 끝점(EP)을 보장하는 동작; 및 시작점(SP) 및/또는 끝점(EP)이 보정된 음성 세그먼트에 대한 음성 인식 수행 또는 음성 인식 결과를 제공하는 동작을 포함할 수 있다. For example, the memory unit 130 may be executed by one or more processors 120 and may be stored in a computer readable medium including instructions for causing the one or more processors 120 to perform the following operations. Operations for a computer program to improve speech recognition performance may be stored. The operations include: receiving trigger voice data and command voice data; Generating a reference speaker model from trigger speech data using a speaker recognition algorithm; Generating a speech segment from the speech data using the speech region detection algorithm; Analyzing the similarity between the reference speaker model and the speech segment; Ensuring a start point (SP) and / or an end point (EP) of the speech segment based on the similarity; And performing a speech recognition or providing a speech recognition result for the speech segment whose start point SP and / or the end point EP are corrected.

메모리부(130)에는 본 발명의 일 실시예에 따른 프로그램을 수행하기 위한 다양한 정보들이 저장될 수 있다. 예를 들어, 메모리부(130)는 음성 영역 검출 알고리즘 및 화자 인식 알고리즘이 저장될 수 있다. 또한, 메모리부(130)는 화자 모델 및 언어 모델 중 적어도 하나를 저장할 수도 있다. 메모리부(130)가 저장할 수 있는 정보들은 전술한 내용들에 제한되지 아니한다. 추가적으로, 전술한 바와 같은 정보들은 음성 세그먼트별로 저장될 수 있으며 본 발명의 권리 범위는 이에 제한되지 않는다. The memory unit 130 may store various information for executing a program according to an embodiment of the present invention. For example, the memory unit 130 may store a voice region detection algorithm and a speaker recognition algorithm. In addition, the memory unit 130 may store at least one of a speaker model and a language model. Information that the memory unit 130 can store is not limited to the above descriptions. In addition, the information as described above may be stored for each voice segment and the scope of the present invention is not limited thereto.

한편, 메모리부(130)는 음성 인식 장치(100)를 제공하는데 필요한 다양한 데이터를 저장하고, 다른 컴포넌트들의 요청에 따라 요청받은 데이터를 제공할 수도 있다. Meanwhile, the memory unit 130 may store various data necessary for providing the speech recognition apparatus 100 and may provide requested data according to requests of other components.

이러한 메모리부(130)는 플래시 메모리부 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리부(예를 들어 SD 또는 XD 메모리부 등), 램(Random Access Memory, RAM), SRAM(Static Random Access Memory), 롬(Read-Only Memory, ROM), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리부, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 터치 입력을 위한 문자 입력 프로그램은 인터넷(internet) 상에서 상기 메모리부(130)의 저장 기능을 수행하는 웹 스토리지(web storage)와 관련되어 동작할 수도 있다.The memory unit 130 may include a flash memory type, a hard disk type, a multimedia card micro type, and a card type memory unit (eg, SD or XD memory). Etc.), RAM (Random Access Memory, RAM), Static Random Access Memory (SRAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM) It may include a storage medium of at least one type of a magnetic memory unit, a magnetic disk, an optical disk. The text input program for touch input may operate in connection with a web storage which performs a storage function of the memory unit 130 on the Internet.

이를 위하여 음성 인식 장치(100)는 통신부를 포함할 수 있다. 통신부는 네트워크 접속을 위한 유/무선 통신 모듈을 포함할 수 있다. To this end, the speech recognition apparatus 100 may include a communication unit. The communication unit may include a wired / wireless communication module for network connection.

무선 인터넷 기술로는 WLAN(Wireless LAN)(Wi-Fi), Wibro(Wireless broadband), Wimax(World Interoperability for Microwave Access), HSDPA(High Speed Downlink Packet Access) 등이 이용될 수 있다.Wireless Internet technologies may include Wireless LAN (Wi-Fi), Wireless Broadband (Wibro), World Interoperability for Microwave Access (Wimax), High Speed Downlink Packet Access (HSDPA), and the like.

유선 인터넷 기술로는 XDSL(Digital Subscriber Line), FTTH(Fibers to the home), PLC(Power Line Communication) 등이 이용될 수 있다. Wired Internet technologies may include Digital Subscriber Line (XDSL), Fibers to the home (FTTH), Power Line Communication (PLC), and the like.

또한, 통신부는 근거리 통신 모듈을 포함하여, 근거리 통신 모듈을 포함한 전자 장치와 데이터를 송수신할 있다. 근거리 통신(short range communication) 기술로 블루투스(Bluetooth), RFID(Radio Frequency Identification), 적외선 통신(IrDA, infrared Data Association), UWB(Ultra-Wide band), ZigBee 등이 이용될 수 있다. 전술한 통신 기술들은 일례일 뿐이며, 본 발명은 이에 제한되지 않는다. In addition, the communication unit may include a short range communication module to transmit and receive data with an electronic device including the short range communication module. As a short range communication technology, Bluetooth, Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra-Wide band (UWB), and ZigBee may be used. The aforementioned communication techniques are merely examples, and the present invention is not limited thereto.

이때, 통신부를 통해 송신 및/또는 수신된 데이터는 메모리부(130)에 저장되거나, 또는 근거리 통신 모듈을 통해 근거리에 있는 다른 장치들로 전송될 수도 있다.In this case, the data transmitted and / or received through the communication unit may be stored in the memory unit 130 or transmitted to other devices in a short distance through the short range communication module.

한편, 음성 인식 장치(100)는 도시되지 않은 컴포넌트들을 더 포함할 수도 있다. 예를 들어, 음성 인식 장치(100)는 디스플레이부를 더 포함할 수 있다. 디스플레이부는 음성 인식 장치(100) 상에서 표시될 수 있는 모든 정보를 표시할 수 있다. 이러한 디스플레이부는 액정 디스플레이(liquid crystal display, LCD), 박막 트랜지스터 액정 디스플레이(thin film transistor-liquid crystal display, TFT LCD), 유기 발광 다이오드(organic light-emitting diode, OLED), 플렉시블 디스플레이(flexible display), 3차원 디스플레이(3D display) 중에서 적어도 하나를 포함할 수 있다. Meanwhile, the speech recognition apparatus 100 may further include components not shown. For example, the voice recognition apparatus 100 may further include a display unit. The display unit may display all information that may be displayed on the voice recognition apparatus 100. Such displays include liquid crystal displays (LCDs), thin film transistor-liquid crystal displays (TFT LCDs), organic light-emitting diodes (OLEDs), flexible displays, It may include at least one of a 3D display.

본 명세서에서 설명되는 다양한 실시예는 예를 들어, 소프트웨어, 하드웨어 또는 이들의 조합된 것을 이용하여 컴퓨터 또는 이와 유사한 장치로 읽을 수 있는 기록매체 또는 저장매체 내에서 구현될 수 있다.The various embodiments described herein can be implemented in a recording medium or storage medium that can be read by a computer or similar device using, for example, software, hardware or a combination thereof.

예를 들어, 하드웨어적인 구현에 의하면, 본 명세서에서 설명되는 실시예는 ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays, 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다. 일부의 경우에 본 명세서에서 설명되는 실시예들이 제어부 자체로 구현될 수 있다.For example, according to a hardware implementation, embodiments described herein may include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), FPGAs ( It may be implemented using at least one of field programmable gate arrays, processors, controllers, micro-controllers, microprocessors, or electrical units for performing other functions. In this case, the embodiments described herein may be implemented by the controller itself.

다른 예시로, 소프트웨어적인 구현에 의하면, 본 명세서에서 설명되는 절차 및 기능과 같은 실시예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 상기 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 작동을 수행할 수 있다. 적절한 프로그램 언어로 씌여진 소프트웨어 애플리케이션으로 소프트웨어 코드가 구현될 수 있다. 상기 소프트웨어 코드는 메모리부(130)에 저장되고, 프로세서(120)에 의해 실행될 수 있다.As another example, in a software implementation, embodiments such as the procedures and functions described herein may be implemented in separate software modules. Each of the software modules may perform one or more functions and operations described herein. Software code may be implemented in software applications written in the appropriate programming languages. The software code may be stored in the memory unit 130 and executed by the processor 120.

한편, 출력부(140)는 프로세서(120)로부터 수행된 음성 인식 결과를 출력 장치로 출력한다. 구체적으로, 출력부(140)는 영상 출력 장치 및/또는 음성 출력 장치를 통해 음성 인식 결과를 출력할 수 있다. 예를 들어, 출력부(140)의 영상 출력 장치는 음성 인식 장치(100)에 포함된 디스플레이부일 수 있으며, 영상 출력 장치를 통해 음성 인식 결과를 영상으로 출력할 수 있다. 또한, 출력부(140)의 음성 출력 장치는 음성 인식 장치(100)에 포함된 스피커일 수 있으며, 음성 출력 장치를 통해 음성 인식 결과를 음성으로 출력할 수 있다.Meanwhile, the output unit 140 outputs the voice recognition result performed by the processor 120 to the output device. In detail, the output unit 140 may output a voice recognition result through the image output device and / or the voice output device. For example, the image output device of the output unit 140 may be a display unit included in the voice recognition device 100, and may output a voice recognition result as an image through the image output device. In addition, the voice output device of the output unit 140 may be a speaker included in the voice recognition device 100, and may output a voice recognition result as a voice through the voice output device.

이하에서는, 설명의 편의를 위하여, 음성 인식 장치(100)가 가전 제품이며, 음성 인식 장치(100)가 어플리케이션의 형태로 가전 제품의 저장 매체에 저장되어 가전 제품의 프로세서(120)에 의해 구동되는 것으로 가정하여 설명하기로 한다.Hereinafter, for convenience of description, the voice recognition device 100 is a home appliance, and the voice recognition device 100 is stored in a storage medium of the home appliance in the form of an application and driven by the processor 120 of the home appliance. It is assumed that the description will be made.

도 1b는 프로세서에 의해 구동되는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 구성을 개략적으로 도시한 블록도이다.1B is a block diagram schematically illustrating a configuration of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention driven by a processor.

도 1b를 참조하면, 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치(100)는 화자 모델 생성부(121), 음성 세그먼트 생성부(122), 유사도 분석부(123) 및 음성 세그먼트 보정부(124)를 포함한다.Referring to FIG. 1B, the speech recognition apparatus 100 using the reference speaker model according to an exemplary embodiment of the present invention may include a speaker model generator 121, a voice segment generator 122, a similarity analyzer 123, and a voice. And a segment correcting unit 124.

화자 모델 생성부(121)는 화자 인식 알고리즘을 사용하여, 트리거 음성 데이터로부터 기준 화자 모델을 생성한다. 구체적으로, 화자 모델 생성부(121)는 입력부(110)로부터 수신된 트리거 음성 데이터의 특징을 추출하고, 트리거 음성 데이터의 특징을 포함하는 기준 화자 모델을 생성한다.The speaker model generator 121 generates a reference speaker model from trigger voice data using a speaker recognition algorithm. In detail, the speaker model generator 121 extracts a feature of the trigger voice data received from the input unit 110 and generates a reference speaker model including the feature of the trigger voice data.

한편, 트리거 음성 데이터는 음성 인식 장치(100)를 동작시키기 위한 명령어이다. 예를 들어, 임의의 사용자가 음성 인식 장치(100)에 임의의 명령어를 발성하는 경우, 음성 인식 장치(100)는 온(ON)될 수 있다. 트리거 음성 데이터는 음성 인식 장치(100)에 따라 다양하게 변형될 수 있다. The trigger voice data is a command for operating the voice recognition apparatus 100. For example, when an arbitrary user speaks an arbitrary command to the speech recognition apparatus 100, the speech recognition apparatus 100 may be turned on. The trigger voice data may be variously modified according to the voice recognition apparatus 100.

음성 세그먼트 생성부(122)는 음성 영역 검출 알고리즘을 사용하여, 명령 음성 데이터를 분석하고, 시작점(SP) 및 끝점(EP)을 갖는 제1 명령 음성 세그먼트(SM1)를 생성한다. The voice segment generator 122 analyzes the command voice data using a voice region detection algorithm and generates a first command voice segment SM1 having a start point SP and an end point EP.

이때, 명령 음성 데이터는 트리거 음성 데이터를 수신하여 온 상태인 음성 인식 장치(100)에 입력된 음성 데이터이다. 구체적으로, 명령 음성 데이터는 임의의 사용자가 음성 인식을 필요로 하는 내용을 담고 있다. 예를 들어, 임의의 사용자가 음성 인식 장치(100)에 명령 음성 데이터를 입력하는 경우, 음성 인식 장치(100)는 명령 음성 데이터를 수신하여 음성 인식 동작을 수행할 수 있다. In this case, the command voice data is voice data input to the voice recognition device 100 that is in a state of receiving trigger voice data. Specifically, the command voice data contains content for which a user needs voice recognition. For example, when a user inputs command voice data to the voice recognition apparatus 100, the voice recognition apparatus 100 may receive the command voice data and perform a voice recognition operation.

제1 명령 음성 세그먼트(SM1)는 명령 음성 데이터에서 시작점(SP) 및 끝점(EP)을 갖는 구간을 구분한 것이다. 구체적으로, 명령 음성 데이터에는 임의의 사용자가 발성한 내용만이 포함될 수도 있지만, 주위 환경에 따라 소음이나 다른 사용자의 발성 내용 등의 잡음이 더 포함될 수 있다. 이때, 음성 세그먼트 생성부(122)는 명령 음성 데이터로부터 잡음을 제외하고, 임의의 사용자가 실제로 발성한 구간을 추출하여 제1 명령 음성 세그먼트(SM1)를 생성한다. The first command voice segment SM1 divides a section having a start point SP and an end point EP from the command voice data. Specifically, the command voice data may include only contents spoken by an arbitrary user, but may further include noise such as noise or speech contents of another user depending on the surrounding environment. In this case, the voice segment generator 122 extracts a section in which any user actually speaks, except for noise, from the command voice data to generate the first command voice segment SM1.

한편, 화자 모델 생성부(121)는 기준 화자 모델 외에 제1 명령 음성 세그먼트(SM1)로부터 복수의 명령 화자 모델을 생성한다. 구체적으로, 제1 명령 음성 세그먼트(SM1)를 복수의 구간으로 나누어지고, 화자 모델 생성부(121)는 복수의 구간 각각에서 특징을 추출하여, 각 구간의 특징을 포함하는 명령 화자 모델을 생성한다. 이때, 복수의 구간은 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 기준으로 제1 명령 음성 세그먼트(SM1)의 외부 영역 또는 내부 영역이다.Meanwhile, the speaker model generator 121 generates a plurality of command speaker models from the first command voice segment SM1 in addition to the reference speaker model. In detail, the first command voice segment SM1 is divided into a plurality of sections, and the speaker model generator 121 extracts a feature from each of the plurality of sections, and generates a command speaker model including the features of each section. . In this case, the plurality of sections are an external area or an internal area of the first command voice segment SM1 based on the start point SP and / or the end point EP of the first command voice segment SM1.

유사도 분석부(123)는 트리거 음성 데이터로부터 생성된 기준 화자 모델과, 명령 음성 데이터로부터 생성된 명령 화자 모델의 유사도를 분석한다. 구체적으로, 유사도 분석부(123)는 기준 화자 모델의 특징과 명령 화자 모델의 특징을 비교하여, 기준 화자 모델과 명령 화자 모델의 유사도를 분석한다.The similarity analyzer 123 analyzes the similarity between the reference speaker model generated from the trigger voice data and the command speaker model generated from the command voice data. In detail, the similarity analyzer 123 analyzes the similarity between the reference speaker model and the command speaker model by comparing the characteristics of the reference speaker model and the feature of the command speaker model.

이때, 유사도 분석부(123)는 유사도로부터 기준 화자 모델과 명령 화자 모델의 동일성을 판단할 수 있다. 예를 들어, 유사도가 임계치를 초과할 경우, 유사도 분석부(123)는 기준 화자 모델의 화자와 명령 화자 모델의 화자를 동일한 화자로 판단할 수 있다. 반면, 유사도가 임계치 이하일 경우, 유사도 분석부(123)는 기준 화자 모델의 화자와 명령 화자 모델의 화자를 서로 다른 화자로 판단할 수 있다. 임계치는 사전에 결정된 값일 수 있으며, 임계치는 다양하게 설정될 수 있다.In this case, the similarity analyzer 123 may determine the identity of the reference speaker model and the command speaker model from the similarity. For example, when the similarity exceeds the threshold, the similarity analyzer 123 may determine the speaker of the reference speaker model and the speaker of the command speaker model as the same speaker. On the other hand, when the similarity is less than or equal to the threshold, the similarity analyzer 123 may determine the speaker of the reference speaker model and the speaker of the command speaker model as different speakers. The threshold may be a predetermined value, and the threshold may be variously set.

음성 세그먼트 보정부(124)는 유사도 분석부(123)로부터 판단된 기준 화자 모델과 명령 화자 모델 간의 유사도 및 동일성 여부에 따라 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 보정하여 제2 명령 음성 세그먼트(SM2)를 생성한다. 구체적으로, 유사도 분석부(123)에서 제1 명령 음성 세그먼트(SM1)의 복수의 구간 중 하나의 구간에서 명령 화자 모델과 기준 화자 모델이 동일하지 않다고 판단된 경우, 음성 세그먼트 보정부(124)는 제1 명령 음성 세그먼트(SM1)에서 상기 하나의 구간이 제외되도록 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 보정할 수 있다. 따라서, 유사도 분석부(123)의 결과에 따라 제2 명령 음성 세그먼트(SM2)의 시작점(SP') 및/또는 끝점(EP')은 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)과 동일할 수도 다를 수도 있다.The voice segment correcting unit 124 may start or stop an end point (SP) and / or an end point (EP) of the first command voice segment SM1 according to the similarity and the sameness between the reference speaker model and the command speaker model determined by the similarity analyzer 123. ) Is generated to generate a second command voice segment SM2. In detail, when the similarity analyzer 123 determines that the command speaker model and the reference speaker model are not the same in one of a plurality of sections of the first command voice segment SM1, the voice segment corrector 124 may determine whether the command speaker model is the same. The start point SP and / or the end point EP of the first command voice segment SM1 may be corrected so that the one section is excluded from the first command voice segment SM1. Therefore, according to the result of the similarity analyzer 123, the start point SP 'and / or the end point EP' of the second command voice segment SM2 is defined as the start point SP and / or the first command voice segment SM1. Or it may be the same as or different from the end point EP.

한편, 음성 인식 장치(100)의 각 구성들은 설명의 편의상 개별적인 구성으로 도시한 것일 뿐, 구현 방법에 따라 하나의 구성으로 구현되거나 하나의 구성이 둘 이상의 구성으로 분리될 수 있다.On the other hand, each of the components of the speech recognition apparatus 100 is only shown as a separate configuration for convenience of description, it may be implemented in one configuration or one configuration may be divided into two or more configurations according to the implementation method.

이하에서는, 앞서 설명한 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치(100)에 대한 설명에 기초하고, 도 2 내지 도 3g를 참조하여 음성 인식 장치(100)를 이용한 음성 인식 방법에 대하여 설명하기로 한다.Hereinafter, a speech recognition method using the speech recognition apparatus 100 based on the description of the speech recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention described above, and referring to FIGS. 2 to 3G. This will be described.

도 2는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법을 설명하기 위한 순서도이다. 도 3a는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 입력부로부터 제1 트리거 음성 데이터를 수신하는 과정을 설명하기 위한 개략도이다. 도 3b는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 화자 인식 알고리즘을 사용하여 제1 트리거 음성 데이터로부터 기준 화자 모델을 생성하는 과정을 설명하기 위한 개략도이다. 도 3c는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 입력부로부터 명령 음성 데이터를 수신하는 과정을 설명하기 위한 개략도이다. 도 3d 는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 음성 영역 검출 알고리즘을 사용하여 명령 음성 데이터로부터 제1 명령 음성 세그먼트를 생성하는 과정을 설명하기 위한 개략도이다. 도 3e는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 제1 명령 음성 세그먼트를 복수의 구간으로 나누는 과정을 설명하기 위한 개략도이다. 도 3f 는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 기준 화자 모델에 기초하여 제1 명령 음성 세그먼트를 보정하여 제2 명령 음성 세그먼트를 생성하는 과정을 설명하기 위한 개략도이다. 도 3g는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치에서 제2 명령 음성 세그먼트에 대한 음성 인식 결과를 제공하는 과정을 설명하기 위한 개략도이다.2 is a flowchart illustrating a speech recognition method using a reference speaker model according to an embodiment of the present invention. 3A is a schematic diagram illustrating a process of receiving first trigger voice data from an input unit in a voice recognition apparatus using a reference speaker model according to an embodiment of the present invention. 3B is a schematic diagram illustrating a process of generating a reference speaker model from first trigger speech data using a speaker recognition algorithm in a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention. 3C is a schematic diagram illustrating a process of receiving command voice data from an input unit in a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention. 3D is a schematic diagram illustrating a process of generating a first command voice segment from command voice data using a voice region detection algorithm in a voice recognition apparatus using a reference speaker model according to an embodiment of the present invention. 3E is a schematic diagram illustrating a process of dividing a first command speech segment into a plurality of sections in a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention. FIG. 3F is a schematic diagram illustrating a process of generating a second command voice segment by correcting a first command voice segment based on the reference speaker model in the voice recognition apparatus using the reference speaker model according to an embodiment of the present invention. 3G is a schematic diagram illustrating a process of providing a speech recognition result for a second command speech segment in a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.

먼저, 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치(100)는 입력부(110)를 통해 제1 화자(150)로부터 제1 트리거 음성 데이터(TV1)를 수신한다(S110). First, the voice recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention receives the first trigger voice data TV1 from the first speaker 150 through the input unit 110 (S110).

도 3a를 참조하면, 음성 인식 장치(100)는 제1 화자(150)로부터 제1 트리거 음성 데이터(TV1)를 수신한다. 이때, 도 3a에서는 제1 트리거 음성 데이터(TV1)가 ““안녕하세요”” 인 것으로 도시되었으나, 제1 트리거 음성 데이터(TV1)는 다양하게 변형될 수 있으며, 이에 제한되지 않는다.Referring to FIG. 3A, the voice recognition apparatus 100 receives first trigger voice data TV1 from the first speaker 150. In this case, although the first trigger voice data TV1 is shown as “hello” in FIG. 3A, the first trigger voice data TV1 may be variously modified, but is not limited thereto.

이어서, 화자 인식 알고리즘을 사용하여, 제1 트리거 음성 데이터(TV1)로부터 기준 화자 모델을 생성한다(S120).Subsequently, a reference speaker model is generated from the first trigger voice data TV1 using a speaker recognition algorithm (S120).

도 3b를 참조하면, 음성 인식 장치(100)는 화자 인식 알고리즘을 사용하여 제1 화자(150)로부터 수신한 제1 트리거 음성 데이터(TV1)에서 제1 화자(150)만의 특징을 추출한다. 예를 들어, 음성 인식 장치(100)는 제1 트리거 음성 데이터(TV1)로부터 제1 화자(150)의 발음, 억양, 강세 등을 파악하고, 이러한 제1 화자(150)의 특징을 포함하는 기준 화자 모델을 생성한다. Referring to FIG. 3B, the speech recognition apparatus 100 extracts a feature of only the first speaker 150 from the first triggered speech data TV1 received from the first speaker 150 using a speaker recognition algorithm. For example, the speech recognition apparatus 100 recognizes the pronunciation, intonation, stress, etc. of the first speaker 150 from the first triggered voice data TV1, and includes a reference including the characteristics of the first speaker 150. Create a speaker model.

이때, 기준 화자 모델은 추후 명령 음성 데이터(OV)로부터 생성된 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및 끝점(EP)의 보정 기준이 되는 화자 모델이다. In this case, the reference speaker model is a speaker model serving as a reference for correcting the start point SP and the end point EP of the first command voice segment SM1 generated from the command voice data OV.

한편, 음성 인식 장치(100)는 제1 트리거 음성 데이터(TV1)로부터 생성된 기준 화자 모델을 메모리부(130)에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn)과 비교하여, 화자 모델을 업데이트하거나, 기준 화자 모델을 메모리부(130)에 새롭게 저장할 수 있다. 화자 모델의 업데이트 및 저장에 대한 구체적인 내용은 도 4 내지 도 6b를 참조하여 후술하기로 한다. Meanwhile, the speech recognition apparatus 100 compares the reference speaker model generated from the first triggered speech data TV1 with a plurality of speaker models MO1, MO2,..., MOn stored in the memory unit 130. The model may be updated or the reference speaker model may be newly stored in the memory unit 130. Details of updating and storing the speaker model will be described later with reference to FIGS. 4 to 6B.

이어서, 제1 화자(150)로부터 명령 음성 데이터(OV)를 수신한다(S130).Subsequently, command voice data OV is received from the first speaker 150 (S130).

도 3c를 참조하면, 음성 인식 장치(100)는 제1 화자(150)로부터 발성된 제1 트리거 음성 데이터(TV1)를 수신한 후, 이어서 명령 음성 데이터(OV)를 수신한다. 이때, 도 3c에서는 명령 음성 데이터(OV)가 ““내일 날씨를 알려주세요”” 인 것으로 도시되었으나, 명령 음성 데이터(OV)는 다양하게 변형될 수 있으며, 이에 제한되지 않는다. Referring to FIG. 3C, the voice recognition apparatus 100 receives the first trigger voice data TV1 spoken from the first speaker 150 and then receives the command voice data OV. In this case, although the command voice data OV is shown as “tell the weather tomorrow” in FIG. 3C, the command voice data OV may be variously modified, but is not limited thereto.

이어서, 음성 영역 검출 알고리즘을 사용하여, 명령 음성 데이터(OV)로부터 시작점(SP) 및 끝점(EP)을 갖는 제1 명령 음성 세그먼트(SM1)를 생성한다(S140).Subsequently, a first command voice segment SM1 having a start point SP and an end point EP is generated from the command voice data OV using the voice region detection algorithm (S140).

도 3d를 참조하면, 음성 인식 장치(100)의 음성 세그먼트 생성부(122)는 음성 영역 알고리즘을 사용하여 명령 음성 데이터(OV)로부터 제1 화자(150)의 실제 발화 구간으로 예상되는 구간을 검출한다. 그리고 음성 세그먼트 생성부(122)는 제1 화자(150)의 실제 발화 구간으로 예상되는 구간의 시작점(SP) 및 끝점(EP)에 대한 정보를 포함하는 제1 명령 음성 세그먼트(SM1)를 생성한다. Referring to FIG. 3D, the speech segment generator 122 of the speech recognition apparatus 100 detects a section expected from the command speech data OV as the actual speech section of the first speaker 150 using a speech domain algorithm. do. The voice segment generator 122 generates a first command voice segment SM1 including information about a start point SP and an end point EP of a section that is expected to be an actual speech section of the first speaker 150. .

음성 인식 장치(100)가 제1 화자(150)로부터 명령 음성 데이터(OV)를 수신할 때, 제1 화자(150)의 음성 외에 잡음 등을 함께 수신할 수 있다. 그리고 음성 세그먼트 생성부(122)는 잡음을 포함하는 명령 음성 데이터(OV) 중 제1 화자(150)의 음성으로 예상되는 구간만을 검출하여 제1 명령 음성 세그먼트(SM1)를 생성한다. When the voice recognition apparatus 100 receives the command voice data OV from the first speaker 150, the voice recognition device 100 may receive noise in addition to the voice of the first speaker 150. The voice segment generator 122 detects only a section of the command voice data OV including noise that is expected to be the voice of the first speaker 150 and generates the first command voice segment SM1.

그리고 제1 명령 음성 세그먼트(SM1)에 포함되지 않는 명령 음성 데이터(OV)의 나머지 구간은 제1 화자(150)의 음성과는 무관한 구간일 수 있다. The remaining section of the command voice data OV not included in the first command voice segment SM1 may be a section unrelated to the voice of the first speaker 150.

이어서, 기준 화자 모델에 기초하여 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 보정하여 제2 명령 음성 세그먼트(SM2)를 생성한다(S150).Next, the second command voice segment SM2 is generated by correcting the start point SP and / or the end point EP of the first command voice segment SM1 based on the reference speaker model (S150).

도 3e를 참조하면, 화자 모델 생성부(121)는 제1 명령 음성 세그먼트(SM1)로부터 복수의 명령 화자 모델을 생성한다.Referring to FIG. 3E, the speaker model generator 121 generates a plurality of command speaker models from the first command voice segment SM1.

먼저, 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 기준으로 제1 명령 음성 세그먼트(SM1)의 내부 영역 또는 외부 영역을 복수의 구간(S1, S2, S3, S4)으로 나눌 수 있다. First, the internal area or the external area of the first command voice segment SM1 is divided into a plurality of sections S1, S2, S3, based on the start point SP and / or the end point EP of the first command voice segment SM1. S4) can be divided.

예를 들어, 복수의 구간(S1, S2, S3, S4)은 제1 명령 음성 세그먼트(SM1)의 시작점(SP)으로부터 제1 명령 음성 세그먼트(SM1)의 외부 영역 사이의 제1 구간(S1), 제1 명령 음성 세그먼트(SM1)의 시작점(SP)으로부터 제1 명령 음성 세그먼트(SM1)의 내부 영역 사이의 제2 구간(S2), 제1 명령 음성 세그먼트(SM1)의 끝점(EP)으로부터 제1 명령 음성 세그먼트(SM1)의 내부 영역 사이의 제3 구간(S3), 제1 명령 음성 세그먼트(SM1)의 끝점(EP)으로부터 제1 명령 음성 세그먼트(SM1)의 외부 영역 사이의 제4 구간(S4)으로 나눌 수 있다. For example, the plurality of sections S1, S2, S3, and S4 may include a first section S1 between a starting point SP of the first command voice segment SM1 and an outer region of the first command voice segment SM1. , The second section S2 between the starting point SP of the first command voice segment SM1 and the internal region of the first command voice segment SM1, and the second point S2 from the end point EP of the first command voice segment SM1. The third section S3 between the inner regions of the first command speech segment SM1, and the fourth section between the outer region of the first command speech segment SM1 from the end point EP of the first command speech segment SM1 ( S4) can be divided.

그리고 화자 모델 생성부(121)는 제1 구간(S1) 내지 제4 구간(S4) 각각에서 명령 음성 데이터(OV)의 특징을 추출하여, 제1 구간(S1)에서는 제1 명령 화자 모델, 제2 구간(S2)에서는 제2 명령 화자 모델, 제3 구간(S3)에서는 제3 명령 화자 모델, 제4 구간(S4)에서는 제4 명령 화자 모델을 생성한다.The speaker model generator 121 extracts a feature of the command voice data OV in each of the first to fourth sections S1, S1, and S1. A second command speaker model is generated in the second section S2, a third command speaker model in the third section S3, and a fourth command speaker model in the fourth section S4.

한편, 도 3e에서는 제1 구간(S1) 내지 제4 구간(S4)의 길이가 동일한 것으로 도시하였으나, 제1 구간(S1) 내지 제4 구간(S4)의 길이는 다양하게 변형될 수 있으며, 이에 제한되지 않는다. Meanwhile, in FIG. 3E, although the lengths of the first section S1 to the fourth section S4 are the same, the lengths of the first section S1 to the fourth section S4 may be variously modified. It is not limited.

아울러, 도 3e에서는 제1 명령 음성 세그먼트(SM1)가 제1 구간(S1) 내지 제4 구간(S4)으로 나뉘어진 것으로 도시하였으나, 복수의 구간(S1, S2, S3, S4)은 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 또는 끝점(EP)을 기준으로 다양하게 나뉘어질 수도 있으며, 이에 제한되지 않는다.In addition, although FIG. 3E illustrates that the first command voice segment SM1 is divided into a first section S1 to a fourth section S4, the plurality of sections S1, S2, S3, and S4 may include a first command. The voice segment SM1 may be divided in various ways based on the start point SP or the end point EP, but is not limited thereto.

이어서, 화자 모델 생성부(121)에서 제1 구간(S1) 내지 제4 구간(S4) 각각의 제1 명령 화자 모델 내지 제4 명령 화자 모델의 생성이 완료되면, 유사도 분석부(123)에서 제1 명령 화자 모델 내지 제4 명령 화자 모델 각각과 기준 화자 모델의 유사도를 분석한다. 그리고 유사도 분석부(123)는 유사도를 기준으로 제1 명령 화자 모델 내지 제4 명령 화자 모델 각각과 기준 화자 모델의 동일성 여부를 판단한다.Subsequently, when generation of the first command speaker model to the fourth command speaker model of each of the first to fourth sections S4 to S4 is completed in the speaker model generator 121, the similarity analysis unit 123 generates the first instruction speaker model. The similarity between each of the first to fourth command speaker models and the reference speaker model is analyzed. The similarity analysis unit 123 determines whether each of the first to fourth command speaker models is identical to the reference speaker model based on the similarity.

예를 들어, 유사도 분석부(123)에서 제1 구간(S1)의 제1 명령 화자 모델과 기준 화자 모델의 유사도를 분석할 수 있다. 그리고 분석 결과, 제1 명령 화자 모델과 기준 화자 모델의 유사도가 임계치를 초과하면, 유사도 분석부(123)는 제1 명령 화자 모델과 기준 화자 모델의 화자가 동일함을 최종적으로 판단할 수 있다. For example, the similarity analyzer 123 may analyze the similarity between the first command speaker model and the reference speaker model in the first section S1. As a result of the analysis, when the similarity between the first command speaker model and the reference speaker model exceeds a threshold, the similarity analyzer 123 may finally determine that the speaker of the first command speaker model and the reference speaker model is the same.

마찬가지로, 유사도 분석부(123)는 제2 구간(S2)의 제2 명령 화자 모델과 기준 화자 모델의 유사도를 분석하여, 제2 명령 화자 모델과 기준 화자 모델의 화자가 동일함을 최종적으로 판단할 수 있다.Similarly, the similarity analyzer 123 analyzes the similarity between the second command speaker model and the reference speaker model in the second section S2 to finally determine that the speaker of the second command speaker model and the reference speaker model are the same. Can be.

유사도 분석부(123)는 제3 구간(S3)의 제3 명령 화자 모델과 기준 화자 모델의 유사도를 분석하여, 제3 명령 화자 모델과 기준 화자 모델의 화자가 동일함을 최종적으로 판단할 수 있다. The similarity analyzer 123 may finally determine that the speaker of the third command speaker model and the reference speaker model are the same by analyzing the similarity between the third command speaker model and the reference speaker model in the third section S3. .

한편, 유사도 분석부(123)는 제4 구간(S4)의 제4 명령 화자 모델과 기준 화자 모델의 유사도를 분석할 수 있다. 그리고 분석 결과, 제4 명령 화자 모델과 기준 화자 모델의 유사도가 임계치 이하이므로, 유사도 분석부(123)는 제4 명령 화자 모델과 기준 화자 모델의 화자가 동일하지 않음을 판단할 수 있다. The similarity analyzer 123 may analyze the similarity between the fourth command speaker model and the reference speaker model in the fourth section S4. As a result of the analysis, since the similarity between the fourth command speaker model and the reference speaker model is less than or equal to the threshold value, the similarity analyzer 123 may determine that the speaker of the fourth command speaker model and the reference speaker model are not the same.

이어서, 도 3f를 참조하면, 음성 세그먼트 보정부(124)는 유사도 분석부(123)의 결과에 따라 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 보정한다. Next, referring to FIG. 3F, the voice segment corrector 124 corrects the start point SP and / or the end point EP of the first command voice segment SM1 according to the result of the similarity analyzer 123.

구체적으로, 유사도 분석부(123)에서 제1 구간(S1)의 제1 명령 화자 모델, 제2 구간(S2)의 제2 명령 화자 모델 및 제3 구간(S3)의 제3 명령 화자 모델이 기준 화자 모델과 동일하다고 판단되고, 제4 구간(S4)의 제4 명령 화자 모델이 기준 화자 모델과 동일하지 않다고 판단되었다.In detail, the similarity analyzer 123 may reference the first command speaker model of the first section S1, the second command speaker model of the second section S2, and the third command speaker model of the third section S3. It is determined that it is the same as the speaker model, and it is determined that the fourth command speaker model in the fourth section S4 is not the same as the reference speaker model.

이에, 음성 세그먼트 보정부(124)는 기준 화자 모델과 동일한 화자 모델로 판단된 제1 구간(S1) 내지 제3 구간(S3)은 제1 명령 음성 세그먼트(SM1)에 포함시키고, 기준 화자 모델과 동일하지 않은 화자 모델로 판단된 제4 구간(S4)은 제1 명령 음성 세그먼트(SM1)에 포함되지 않도록 제외시킨다.Accordingly, the voice segment correcting unit 124 includes the first section S1 to the third section S3 that are determined to be the same speaker model as the reference speaker model in the first command voice segment SM1, and includes the reference speaker model. The fourth section S4 determined as not being the same speaker model is excluded from being included in the first command voice segment SM1.

즉, 음성 세그먼트 보정부(124)는 명령 화자 모델과 기준 화자 모델이 동일한 경우, 명령 화자 모델에 대응하는 구간이 제1 명령 음성 세그먼트(SM1)에 포함되도록 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 보정한다.That is, when the command speaker model and the reference speaker model are the same, the voice segment corrector 124 may start the first command voice segment SM1 such that a section corresponding to the command speaker model is included in the first command voice segment SM1. Calibrate (SP) and / or end point (EP).

아울러, 음성 세그먼트 보정부(124)는 명령 화자 모델과 기준 화자 모델이 동일하지 않은 경우, 명령 화자 모델에 대응하는 구간이 제1 명령 음성 세그먼트(SM1)에 포함되지 않도록 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)을 보정한다.In addition, when the command speaker model and the reference speaker model are not the same, the voice segment corrector 124 may include the first command voice segment SM1 such that a section corresponding to the command speaker model is not included in the first command voice segment SM1. Correct the start point SP and / or the end point EP of).

따라서, 음성 세그먼트 보정부(124)에 의해 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및 끝점(EP)이 보정되어 제2 명령 음성 세그먼트(SM2)가 생성된다.Therefore, the start point SP and the end point EP of the first command voice segment SM1 are corrected by the voice segment corrector 124 to generate the second command voice segment SM2.

마지막으로, 제2 명령 음성 세그먼트(SM2)에 대한 음성 인식 수행 또는 음성 인식 결과를 제공한다(S160).Finally, the voice recognition performed on the second command voice segment SM2 or the voice recognition result is provided (S160).

도 3g를 참조하면, 음성 인식 장치(100)는 기준 화자 모델에 기초하여 시작점(SP') 및 끝점(EP')이 보정된 제2 명령 음성 세그먼트(SM2)에 대한 음성 인식을 수행한다. Referring to FIG. 3G, the speech recognition apparatus 100 performs speech recognition on the second command speech segment SM2 whose start point SP 'and the end point EP' are corrected based on the reference speaker model.

제2 명령 음성 세그먼트(SM2)는 제1 화자(150)의 ““내일 날씨를 알려주세요”” 에 대한 명령 음성 데이터(OV)를 포함하고, 음성 인식 장치(100)는 제2 명령 음성 세그먼트(SM2)를 분석한다. The second command voice segment SM2 includes command voice data OV for “tell me the weather for tomorrow” of the first speaker 150, and the voice recognition apparatus 100 includes the second command voice segment ( SM2) is analyzed.

음성 인식 장치(100)는 제2 명령 음성 세그먼트(SM2)를 분석하여 제1 화자(150)에게 내일 날씨에 대한 정보를 알려준다. 예를 들어, 음성 인식 장치(100)는 ““내일 날씨는 흐림입니다”” 라고 음성 인식 결과를 음성으로 출력할 수 있다. 다만, 음성 인식 장치(100)는 음성만이 아니라 디스플레이부 등을 통해 영상으로 음성 인식 결과를 출력할 수도 있으며, 이에 제한되지 않는다.The voice recognition apparatus 100 analyzes the second command voice segment SM2 and informs the first speaker 150 of the weather for tomorrow. For example, the speech recognition apparatus 100 may output a speech recognition result as ““ the weather is cloudy tomorrow ”.” However, the voice recognition apparatus 100 may output a voice recognition result as an image through a display unit or the like as well as the voice, but is not limited thereto.

본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법 및 음성 인식 장치(100)는 제1 트리거 음성 데이터(TV1)를 이용하여 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및 끝점(EP)을 보정한다. 구체적으로, 음성 인식 장치(100)의 입력부(110)는 제1 화자(150)로부터 제1 트리거 음성 데이터(TV1)를 수신하고, 화자 모델 생성부(121)는 제1 트리거 음성 데이터(TV1)로부터 기준 화자 모델을 생성한다. 이어서, 음성 인식 장치(100)의 입력부(110)는 제1 화자(150)로부터 명령 음성 데이터(OV)를 수신하고, 음성 세그먼트 생성부(122)는 명령 음성 데이터(OV)로부터 제1 화자(150)의 실제 발화 구간으로 예상되는 구간의 시작점(SP) 및 끝점(EP)을 설정하여 제1 명령 음성 세그먼트(SM1)를 생성한다. 이어서, 음성 인식 장치(100)의 유사도 분석부(123)는 제1 명령 음성 세그먼트(SM1)의 복수의 구간에서의 명령 화자 모델과 기준 화자 모델을 비교하여 유사도 및 동일성 여부를 판단한다. 마지막으로, 음성 인식 장치(100)의 음성 세그먼트 보정부(124)는 유사도 분석부(123)의 결과에 따라, 기준 화자 모델과 동일하다고 판단된 명령 화자 모델에 대응하는 구간은 제1 명령 음성 세그먼트(SM1)에 포함되도록 하고, 기준 화자 모델과 동일하지 않다고 판단된 명령 화자 모델에 대응하는 구간은 제1 명령 음성 세그먼트(SM1)에 포함되지 않도록 한다. 따라서, 음성 세그먼트 보정부(124)에 의해 제1 명령 음성 세그먼트(SM1)의 시작점(SP) 및/또는 끝점(EP)이 보정되고, 제2 명령 음성 세그먼트(SM2)가 생성된다. The voice recognition method and the voice recognition apparatus 100 using the reference speaker model according to an exemplary embodiment of the present invention use the first trigger voice data TV1 to start point SP and end point of the first command voice segment SM1. Correct (EP). In detail, the input unit 110 of the speech recognition apparatus 100 receives the first trigger voice data TV1 from the first speaker 150, and the speaker model generator 121 receives the first trigger voice data TV1. Create a reference speaker model from. Subsequently, the input unit 110 of the speech recognition apparatus 100 receives the command voice data OV from the first speaker 150, and the voice segment generator 122 receives the first speaker (from the command voice data OV). The first command voice segment SM1 is generated by setting the start point SP and the end point EP of the section expected as the actual utterance section 150. Subsequently, the similarity analyzer 123 of the speech recognition apparatus 100 compares the command speaker model and the reference speaker model in a plurality of sections of the first command voice segment SM1 to determine similarity and identity. Lastly, according to the result of the similarity analyzer 123, the voice segment correcting unit 124 of the speech recognition apparatus 100 determines that the section corresponding to the command speaker model determined to be the same as the reference speaker model is the first command voice segment. The section corresponding to the command speaker model determined to be not the same as the reference speaker model is not included in the first command voice segment SM1. Therefore, the start point SP and / or the end point EP of the first command voice segment SM1 is corrected by the voice segment corrector 124, and the second command voice segment SM2 is generated.

이때, 명령 음성 데이터(OV)는 제1 화자(150)의 음성만이 아니라, 주위의 잡음, 다른 화자의 음성 등이 포함될 수 있다. 그러므로, 제1 명령 음성 세그먼트(SM1)는 제1 화자(150)가 실제로 발화한 구간을 포함할 수도, 포함하지 않을 수도 있다. 이에, 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법 및 음성 인식 장치(100)는 명령 음성 세그먼트의 시작점(SP) 및 끝점(EP) 검출의 정확도를 높이기 위해, 음성 인식 장치(100)를 온 시키기 위한 명령어인 제1 트리거 음성 데이터(TV1)를 이용하여 제1 명령 음성 세그먼트(SM1)를 보정할 수 있다. 음성 인식 장치(100)는 제1 트리거 음성 데이터(TV1)로부터 제1 화자(150)의 특징을 추출한 기준 화자 모델을 생성하고, 제1 명령 음성 세그먼트(SM1)의 각 구간을 기준 화자 모델과 비교하여 유사도를 분석한다. 이때, 기준 화자 모델과 유사한 특징을 갖는 명령 화자 모델에 대응하는 구간만이 제1 명령 음성 세그먼트(SM1)에 포함되도록 시작점(SP) 및/또는 끝점(EP)을 보정한다. 따라서, 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법 및 음성 인식 장치(100)는 트리거 음성 데이터라는 기준을 이용하여 명령 음성 세그먼트의 시작점 및 끝점을 정확하게 검출할 수 있고, 음성 인식 성능이 향상될 수 있다.In this case, the command voice data OV may include not only the voice of the first speaker 150 but also ambient noise, voices of other speakers, and the like. Therefore, the first command voice segment SM1 may or may not include a section in which the first speaker 150 actually uttered. Accordingly, the speech recognition method and the speech recognition apparatus 100 using the reference speaker model according to the embodiment of the present invention may increase the accuracy of detecting the start point SP and the end point EP of the command speech segment. The first command voice segment SM1 may be corrected using the first trigger voice data TV1 which is a command for turning on 100. The speech recognition apparatus 100 generates a reference speaker model obtained by extracting the feature of the first speaker 150 from the first trigger voice data TV1 and compares each section of the first command voice segment SM1 with the reference speaker model. Analyze the similarity. At this time, the start point SP and / or the end point EP are corrected such that only a section corresponding to the command speaker model having characteristics similar to the reference speaker model is included in the first command voice segment SM1. Therefore, the voice recognition method and the voice recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention can accurately detect the start point and the end point of the command voice segment using a criterion called trigger voice data, and recognize the voice. Performance can be improved.

도 4는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 메모리부를 개략적으로 도시한 블록도이다.4 is a block diagram schematically illustrating a memory unit of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.

도 4를 참조하면, 메모리부는 도 1a를 참조하여 상술한 바와 같이, 복수의 화자 모델이 저장된다. Referring to FIG. 4, as described above with reference to FIG. 1A, the memory unit stores a plurality of speaker models.

복수의 화자 모델(MO1, MO2, …… , MOn)은 각각은 하나의 화자에 대한 특징을 포함한다. 예를 들어, 제1 화자 모델(MO1)은 제1 화자(150)로부터 수신한 음성 데이터를 분석하여, 제1 화자(150)에 대한 특징을 저장한 화자 모델이고, 제2 화자 모델(MO2)은 제2 화자로부터 수신한 음성 데이터를 분석하여, 제2 화자에 대한 특징을 저장한 화자 모델이다. 따라서, 복수의 화자 모델(MO1, MO2, …… , MOn) 각각에는 서로 다른 화자에 대한 특징이 저장된다.The plurality of speaker models MO1, MO2, ..., MOn each include a feature for one speaker. For example, the first speaker model MO1 is a speaker model that analyzes voice data received from the first speaker 150 and stores characteristics of the first speaker 150, and the second speaker model MO2. Is a speaker model that analyzes voice data received from the second speaker and stores the characteristics of the second speaker. Therefore, the characteristics of different speakers are stored in each of the speaker models MO1, MO2, ..., MOn.

한편, 음성 인식 장치(100)는 특정 화자의 트리거 음성 데이터로부터 생성된 기준 화자 모델을 메모리부(130)에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn)과 비교하여, 화자 모델을 업데이트하거나, 기준 화자 모델을 메모리부(130)에 새롭게 저장할 수 있다. On the other hand, the speech recognition apparatus 100 compares the speaker model generated from the trigger voice data of a specific speaker with the speaker models MO1, MO2, ..., MOn stored in the memory unit 130, and compares the speaker model. The reference speaker model may be updated or newly stored in the memory unit 130.

도 5a 및 도 5b는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 메모리부에 제2 기준 화자 모델을 저장하는 과정을 설명하기 위한 예시적인 도면이다.5A and 5B are exemplary diagrams for describing a process of storing a second reference speaker model in a memory unit of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.

도 5a를 참조하면, 음성 인식 장치(100)는 제2 화자로부터 제2 트리거 음성 데이터를 수신하고, 제2 트리거 음성 데이터를 분석하여 제2 기준 화자 모델(TMO2)을 생성한다. Referring to FIG. 5A, the voice recognition apparatus 100 receives second trigger voice data from a second speaker, analyzes the second trigger voice data, and generates a second reference speaker model TMO2.

그리고 음성 인식 장치(100)는 메모리부(130)에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn) 각각과 제2 기준 화자 모델(TMO2)을 비교하여 유사도를 분석하고, 동일성 여부를 판단한다. In addition, the speech recognition apparatus 100 compares each of the plurality of speaker models MO1, MO2,..., MOn stored in the memory unit 130 with the second reference speaker model TMO2 to analyze the similarity, and determine whether or not there is an equality. To judge.

복수의 화자 모델(MO1, MO2, …… , MOn) 중 제2 기준 화자 모델(TMO2)과 동일하다고 판단된 화자 모델이 있는 경우, 음성 인식 장치(100)는 제2 기준 화자 모델(TMO2)이 화자 모델에 포함되도록, 화자 모델을 업데이트할 수 있다.When there is a speaker model that is determined to be the same as the second reference speaker model TMO2 among the plurality of speaker models MO1, MO2,..., MOn, the speech recognition apparatus 100 determines that the second reference speaker model TMO2 is the same. The speaker model can be updated to be included in the speaker model.

예를 들어, 도 5b를 참조하면, 제2 기준 화자 모델(TMO2)이 제1 화자 모델(MO1)과 동일하다고 판단된 경우, 음성 인식 장치(100)는 제1 화자 모델(MO1)이 제2 기준 화자 모델(TMO2)에 대한 정보를 포함하도록 제1 화자 모델(MO1)을 업데이트 할 수 있다. 따라서, 업데이트 된 제1 화자 모델(MO1')은 제1 화자(150)에 대한 정보와 함께 제2 기준 화자 모델(TMO2)에 대한 정보가 함께 저장된다.For example, referring to FIG. 5B, when it is determined that the second reference speaker model TMO2 is the same as the first speaker model MO1, the speech recognition apparatus 100 may determine that the first speaker model MO1 is the second speaker. The first speaker model MO1 may be updated to include information about the reference speaker model TMO2. Therefore, the updated first speaker model MO1 ′ is stored with information about the second reference speaker model TMO2 together with information about the first speaker 150.

반면, 제2 기준 화자 모델(TMO2)이 메모리부(130)에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn)과 동일하지 않은 경우, 제2 기준 화자 모델(TMO2)을 새롭게 저장할 수 있다. 이에 대한 상세한 설명은 도 6a 및 도 6b를 참조하여 설명하기로 한다.On the other hand, when the second reference speaker model TMO2 is not the same as the plurality of speaker models MO1, MO2,..., MOn stored in the memory unit 130, the second reference speaker model TMO2 may be newly stored. have. Detailed description thereof will be described with reference to FIGS. 6A and 6B.

도 6a 및 도 6b는 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 장치의 메모리부에 제2 기준 화자 모델을 저장하는 과정을 설명하기 위한 예시적인 도면이다.6A and 6B are exemplary diagrams for describing a process of storing a second reference speaker model in a memory unit of a speech recognition apparatus using a reference speaker model according to an embodiment of the present invention.

도 6a를 참조하면, 음성 인식 장치(100)는 제2 화자로부터 제2 트리거 음성 데이터를 수신하고, 제2 트리거 음성 데이터를 분석하여 제2 기준 화자 모델(TMO2)을 생성한다.Referring to FIG. 6A, the voice recognition apparatus 100 receives second trigger voice data from a second speaker, analyzes the second trigger voice data, and generates a second reference speaker model TMO2.

그리고 음성 인식 장치(100)는 메모리부(130)에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn) 각각과 제2 기준 화자 모델(TMO2)을 비교하여 유사도를 분석하고, 동일성 여부를 판단한다.In addition, the speech recognition apparatus 100 compares each of the plurality of speaker models MO1, MO2,..., MOn stored in the memory unit 130 with the second reference speaker model TMO2 to analyze the similarity, and determine whether or not there is an equality. To judge.

이때, 메모리부(130)에 제2 기준 화자 모델(TMO2)과 유사도가 높고, 동일하다고 판단된 화자 모델이 없는 경우, 메모리부(130)에 제2 기준 화자 모델(TMO2)을 새로운 화자 모델로 저장한다. At this time, when the similarity with the second reference speaker model TMO2 is high and there is no speaker model determined in the memory unit 130, the second reference speaker model TMO2 is used as the new speaker model in the memory unit 130. Save it.

예를 들어, 도 6b를 참조하면, 음성 인식 장치(100)는 메모리부(130)에 제2 기준 화자 모델(TMO2)을 제n+1 화자 모델(MOn+1)로 저장한다. 따라서, 메모리부(130)는 기존에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn)과 제2 기준 화자 모델(TMO2)로부터 새롭게 저장된 제n+1 화자 모델(MOn+1)을 포함한다. For example, referring to FIG. 6B, the speech recognition apparatus 100 stores the second reference speaker model TMO2 as the n + 1 speaker model MOn + 1 in the memory unit 130. Accordingly, the memory unit 130 includes a plurality of speaker models MO1, MO2,..., MOn stored previously and an n + 1 speaker model MOn + 1 newly stored from the second reference speaker model TMO2. do.

이에, 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법 및 음성 인식 장치(100)는 트리거 음성 데이터로부터 생성된 기준 화자 모델을 메모리부(130)에 저장한다. 구체적으로, 메모리부(130)에는 복수의 화자 모델(MO1, MO2, …… , MOn)이 저장된다. 음성 인식 장치(100)는 메모리부(130)에 저장된 복수의 화자 모델(MO1, MO2, …… , MOn) 각각과 기준 화자 모델의 유사도를 분석하고, 동일성을 판단한다. 기준 화자 모델과 동일하다고 판단된 화자 모델이 검출된 경우, 메모리부(130)에 저장된 화자 모델에 기준 화자 모델의 정보를 추가하여 화자 모델을 업데이트 한다. 반면, 기준 화자 모델과 동일하다고 판단된 화자 모델이 검출되지 않은 경우, 메모리부(130)에 기준 화자 모델을 새롭게 저장한다. Accordingly, the voice recognition method and the voice recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention store the reference speaker model generated from the trigger voice data in the memory unit 130. In detail, the plurality of speaker models MO1, MO2,..., MOn are stored in the memory unit 130. The speech recognition apparatus 100 analyzes the similarity between each of the speaker models MO1, MO2,..., MOn stored in the memory unit 130 and the reference speaker model, and determines the sameness. When the speaker model determined to be the same as the reference speaker model is detected, the speaker model is updated by adding information of the reference speaker model to the speaker model stored in the memory unit 130. On the other hand, if the speaker model determined to be the same as the reference speaker model is not detected, the reference speaker model is newly stored in the memory unit 130.

따라서, 본 발명의 일 실시예에 따른 기준 화자 모델을 이용한 음성 인식 방법 및 음성 인식 장치(100)에서는 기준 화자 모델에 대한 정보를 메모리부(130)에 저장한다. 메모리부(130)에 새롭게 저장된 화자 모델은 이후 화자 모델과 관련한 화자에 대한 음성 인식을 수행할 때, 업데이트될 수 있고, 음성 인식 장치(100)의 음성 인식 능력을 향상시킬 수 있다. 또한, 계속해서 업데이트된 화자 모델은 특정 화자에 대한 보다 상세한 정보 및 특징을 포함할 수 있고, 이후 음성 인식 장치(100)의 음성 인식 능력 또한 향상될 수 있다. Accordingly, the voice recognition method and the voice recognition apparatus 100 using the reference speaker model according to an embodiment of the present invention store information about the reference speaker model in the memory unit 130. The speaker model newly stored in the memory unit 130 may be updated when speech recognition of the speaker related to the speaker model is performed later, and the speech recognition capability of the speech recognition apparatus 100 may be improved. In addition, the continuously updated speaker model may include more detailed information and features of a specific speaker, and then the voice recognition capability of the voice recognition apparatus 100 may also be improved.

본 명세서에서, 각 블록 또는 각 단계는 특정된 논리적 기능 (들) 을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또한, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In this specification, each block or each step may represent a portion of a module, segment or code containing one or more executable instructions for executing a specified logical function (s). It should also be noted that in some alternative embodiments the functions noted in the blocks or steps may occur out of order. For example, the two blocks or steps shown in succession may in fact be executed substantially concurrently or the blocks or steps may sometimes be performed in the reverse order, depending on the functionality involved.

본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서에 의해 실행되는 하드웨어, 소프트웨어 모듈 또는 그 2 개의 결합으로 직접 구현될 수도 있다. 소프트웨어 모듈은 RAM 메모리부, 플래시 메모리부, ROM 메모리부, EPROM 메모리부, EEPROM 메모리부, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM 또는 당업계에 알려진 임의의 다른 형태의 저장 매체에 상주할 수도 있다. 예시적인 저장 매체는 프로세서에 커플링되며, 그 프로세서는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented directly in hardware, a software module or a combination of the two executed by a processor. The software module may reside in a RAM memory section, flash memory section, ROM memory section, EPROM memory section, EEPROM memory section, register, hard disk, removable disk, CD-ROM or any other form of storage medium known in the art. have. An exemplary storage medium is coupled to the processor, which can read information from and write information to the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application specific integrated circuit. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것은 아니고, 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although the embodiments of the present invention have been described in more detail with reference to the accompanying drawings, the present invention is not necessarily limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention. . Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention but to describe the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

100: 음성 인식 장치
110: 입력부
120: 프로세서
121: 화자 모델 생성부
122: 음성 세그먼트 생성부
123: 유사도 분석부
124: 음성 세그먼트 보정부
130: 메모리부
140: 출력부
150: 제1 화자
TV1 : 제1 트리거 음성 데이터
OV: 명령 음성 데이터
SM1: 제1 명령 음성 세그먼트
SM2: 제2 명령 음성 세그먼트
SP, SP': 시작점
EP, EP': 끝점
S1: 제1 구간
S2: 제2 구간
S3: 제3 구간
S4: 제4 구간
MO1, MO1': 제1 화자 모델
MO2: 제2 화자 모델
MOn: 제n 화자 모델
MOn+1: 제n+1 화자 모델
TMO2: 제2 기준 화자 모델100: speech recognition device
110: input unit
120: processor
121: speaker model generator
122: voice segment generator
123: Similarity Analysis
124: voice segment correction unit
130: memory unit
140: output unit
150: first speaker
TV1: first trigger voice data
OV: command voice data
SM1: first command voice segment
SM2: second command voice segment
SP, SP ': starting point
EP, EP ': Endpoint
S1: first section
S2: second section
S3: third section
S4: fourth section
MO1, MO1 ': first speaker model
MO2: Second Speaker Model
MOn: Control Speaker Model
MOn + 1: n + 1 speaker model
TMO2: Second Reference Speaker Model

Claims

Receiving first trigger voice data from a first speaker;
Generating a reference speaker model from the first triggered speech data using a speaker recognition algorithm;
Receiving command voice data from the first speaker;
Generating a first command voice segment having a start point and an end point from the command voice data using a voice region detection algorithm;
Generating a second command speech segment by correcting the starting point and / or the end point of the first command speech segment based on the reference speaker model; And
And performing a voice recognition or providing a voice recognition result for the second command voice segment.

The method of claim 1,
Generating the second command voice segment may include:
Dividing the first command voice segment into a plurality of sections;
Generating a command speaker model in each of the plurality of sections using the speaker recognition algorithm; And
And determining the identity of the command speaker model and the reference speaker model.

The method of claim 2,
If the command speaker model and the reference speaker model are the same,
The generating of the second command voice segment may include correcting the start point and / or the end point of the first command voice segment such that the section corresponding to the command speaker model is included in the first command voice segment. , Speech recognition method using reference speaker model.

The method of claim 2,
If the command speaker model and the reference speaker model are not the same,
The generating of the second command voice segment may include correcting the start point and / or the end point of the first command voice segment such that the section corresponding to the command speaker model is excluded from the first command voice segment. , Speech recognition method using reference speaker model.

The method of claim 2,
The plurality of sections,
A first section between the starting point of the first command speech segment and an outer region of the first command speech segment;
A second section between the starting point of the first command speech segment and an interior region of the first command speech segment;
A third section between the end point of the first command speech segment and an interior region of the first command speech segment; And
And at least one of a fourth section between the end point of the first command speech segment and an outer area of the first command speech segment.

The method of claim 1,
Storing the reference speaker model in a memory unit;
Receiving second trigger voice data from a second speaker; And
And generating a second reference speaker model from the second trigger speech data using the speaker recognition algorithm.

The method of claim 6,
If the second reference speaker model and the reference speaker model are the same,
And updating the reference speaker model based on the second reference speaker model.

The method of claim 6,
If the second reference speaker model and the reference speaker model are not the same,
And storing the second reference speaker model in a memory unit.

An input unit to receive trigger voice data and command voice data;
A speaker model generator configured to generate the reference speaker model by analyzing the trigger voice data;
A voice segment generator configured to analyze the command voice data to generate a first command voice segment having a start point and an end point; And
And a speech segment corrector configured to correct the starting point and / or the end point of the first command speech segment to generate a second command speech segment based on the reference speaker model.

The method of claim 9,
The first command voice segment includes a plurality of sections,
And the speaker model generator generates a command speaker model in each of the plurality of sections.

The method of claim 10,
Further comprising a similarity analysis unit for analyzing the similarity between the reference speaker model and the command speaker model,
When the similarity of the reference speaker model and the command speaker model exceeds a threshold,
The speech segment correcting unit corrects the start point and / or the end point of the first command speech segment such that the section corresponding to the command speaker model is included in the first command speech segment. Device.

The method of claim 11,
When the similarity between the reference speaker model and the command speaker model is less than or equal to the threshold,
The speech segment corrector corrects the start point and / or the end point of the first command speech segment such that the section corresponding to the command speaker model is excluded from the first command speech segment. Device.

The method of claim 10,
The plurality of sections,
And an outer region or an inner region of the first command speech segment based on the start point or the end point of the first command speech segment.

The method of claim 9,
Further comprising a memory unit for storing a plurality of speaker models,
The speaker model generating unit analyzes the trigger voice data, and when the trigger voice data corresponds to one speaker model of the plurality of speaker models stored in the memory unit, analyzes the trigger voice data to generate the one speaker model. And update the updated one speaker model as the reference speaker model.