KR20200036820A

KR20200036820A - Apparatus and Method for Sound Source Separation based on Rada

Info

Publication number: KR20200036820A
Application number: KR1020200024428A
Authority: KR
Inventors: 김용재
Original assignee: (주)스마트레이더시스템
Priority date: 2018-09-28
Filing date: 2020-02-27
Publication date: 2020-04-07
Also published as: KR102407872B1

Abstract

The present invention relates to an apparatus for a voice recognition service based on radar. The apparatus capable of exactly recognizing a user command including a wake-up word comprises: a radar; a sound source location tracking unit analyzing a transmission/reception signal of the radar to search a location of a speaker; a sound acquisition unit receiving ambient sound; a sound source separation unit separating voice of the speaker from the sound inputted through the sound acquisition unit based on information on the location of the speaker tracked by the sound source location tracking unit; and a postprocessing unit recognizing the voice of the speaker separated by the sound source separation unit.

Description

Apparatus and Method for Sound Source Separation based on Rada}

본 발명은 음성 인식 기술에 관한 것으로, 특히 레이더 기술을 이용한 음성 인식 서비스 장치에 관한 것이다.The present invention relates to a speech recognition technology, and more particularly to a speech recognition service device using radar technology.

음성 인식이란 기계로 하여금 인간의 일상적인 음성을 이해하고 이해된 음성에 따라 업무를 수행하게 하는 것을 말한다. 음성 인식의 기술은 컴퓨터와 정보 통신의 발달로 인해 인간이 직접 움직이지 않고서도 원거리에서 정보를 손쉽게 얻을 수 있으며, 음성에 따라 작동하는 시스템으로 이루어진 기기들의 개발로 이어지고 있다. Speech recognition refers to having a machine understand human everyday speech and perform tasks according to the understood speech. With the development of computer and information communication, the technology of speech recognition can easily obtain information from a long distance without a human being directly moving, leading to the development of devices composed of systems that operate according to voice.

최근 음성 인식 기술의 발전에 따라, 애플의 시리(Siri), 구글의 나우(Now), 마이크로소프트의 코타나(Cortana), 아마존의 알렉사(Alexa) 등과 같은 다앙한 음성 인식 서비스가 출시되어 있다. 이러한 음성 인식 서비스 제공에 따른 사용자 명령을 정확히 인식하기 위해서는 화자의 음성과 같은 목표 음원의 위치를 실시간으로 추적하고 잡음이 존재하는 환경에서 마이크에 수신된 신호에서 각 음원을 분리하여 이들의 음원 위치를 파악하는 것이 요구된다. With recent advances in speech recognition technology, various speech recognition services such as Apple's Siri, Google's Now, Microsoft's Cortana, and Amazon's Alexa have been released. In order to accurately recognize the user command according to the provision of the speech recognition service, the location of a target sound source such as a speaker's voice is tracked in real time, and each sound source is separated from the signal received by the microphone in an environment where noise is present to locate their sound source. It is required to grasp.

그런데, 기존 음성 인식 서비스에서는 소리가 발생한 후에 소리 신호에 대하여 분석을 수행하여 음원 탐색(source localization)하므로, 음원 탐색에 따른 시간 지연이 발생된다. 예컨대, 웨이크업 워드(Wake-up word)인 '알렉사'가 호명되면 음성 인식 서비스가 시작되는데, 이때 음성 인식 서비스 장치는 '알렉사'라는 음원 탐색이 이루어진 후에야 화자를 분리하여 음성 인식을 수행한다. 그런데, 음원 탐색에 따른 시간 지연으로 '알'은 인식되지 못하고, '렉사'만 인식되므로, 음성 인식 서비스 장치는 웨이크 업(Wake-up) 여부를 인지하지 못하게 된다. 따라서, 사용자가 '알렉사'를 재호명해야 하므로 번거로울 수 있다. 또한, 사용자가 이동함에 따라 음원 추적이 계속적으로 이루어진 후, 음성 인식이 수행되므로 사용자 명령이 정확하기 인식되지 않을 수도 있다.However, in the existing voice recognition service, since a sound signal is analyzed and then a sound signal is searched for source localization, a time delay due to sound source search is generated. For example, when the wake-up word 'Alexa' is called, the voice recognition service starts. At this time, the voice recognition service device separates the speaker and performs voice recognition only after the 'Alexa' sound source search is performed. However, since the 'al' is not recognized due to the time delay due to the search for the sound source, and only the 'lexa' is recognized, the voice recognition service device cannot recognize whether a wake-up is performed. Therefore, it may be cumbersome because the user has to re-name 'Alexa'. In addition, since sound source tracking is continuously performed as the user moves, voice recognition is performed, so the user command may not be accurately recognized.

본 발명은 웨이크업 워드(Wake-up word)를 포함하는 사용자 명령을 정확히 인식할 수 있는 레이더 기반 음성 인식 서비스 장치 및 방법을 제공한다.The present invention provides a radar-based speech recognition service device and method capable of accurately recognizing a user command including a wake-up word.

또한, 음원 탐색에 따른 음성 인식 지연을 방지할 수 있는 레이더 기반 음성 인식 서비스 장치 및 방법을 제공한다.In addition, a radar-based speech recognition service apparatus and method capable of preventing a speech recognition delay due to sound source search are provided.

또한, 사용자가 이동하더라도 정확한 음원 인식을 수행할 수 있는 레이더 기반 음성 인식 서비스 장치 및 방법을 제공한다.In addition, a radar-based speech recognition service apparatus and method capable of performing accurate sound source recognition even when a user moves.

본 발명은 레이더 기반 음성 인식 서비스 장치로, 레이더와, 레이더의 송/수신 신호를 분석하여 화자의 위치를 탐색하는 음원 위치 추적부와, 주변 음향을 입력받는 음향 획득부와, 음원 위치 추적부에 의해 추적된 화자 위치 정보를 기반으로 음향 획득부를 통해 입력된 음향에서 화자의 음성을 분리하는 음원 분리부와, 음원 분리부에 의해 분리된 화자의 음성을 인식하는 후처리부를 포함한다. The present invention is a radar-based speech recognition service device, a radar, a sound source location tracking unit that analyzes a radar's transmission / reception signal to search for a speaker's location, a sound acquisition unit that receives ambient sound, and a sound source location tracking unit It includes a sound source separation unit for separating the speaker's voice from the sound input through the sound acquisition unit based on the speaker location information tracked by, and a post-processing unit for recognizing the speaker's voice separated by the sound source separation unit.

본 발명은 레이더 기반 음성 인식 서비스 방법으로, 레이더의 송/수신 신호를 분석하여 화자의 위치를 추적하는 단계와, 추적된 화자 위치 정보를 기반으로 입력된 음향에서 화자의 음성을 분리하는 단계와, 분리된 화자의 음성을 인식하는 단계를 포함한다. The present invention is a radar-based speech recognition service method, the step of tracking the speaker's location by analyzing the radar's transmit / receive signal, and separating the speaker's voice from the input sound based on the tracked speaker location information, And recognizing the voice of the separated speaker.

본 발명에 따라, 미리 레이더 기반으로 신속히 음원 위치 탐색하여 음원 탐색에 따른 지연이 발생되지 않으므로, 웨이크업 워드(Wake-up word)를 포함하는 사용자 명령을 정확히 인식할 수 있다. 또한, 사용자가 이동하더라도 음원 위치 탐색이 신속하게 이루어지므로, 정확한 음성 인식을 수행할 수 있다.According to the present invention, since a delay according to a sound source search is not generated by quickly searching a sound source location based on a radar in advance, a user command including a wake-up word can be accurately recognized. In addition, since the location search of the sound source is quickly performed even when the user moves, accurate speech recognition can be performed.

음원 위치를 추적하는 기술은 가전제품 뿐만 아니라 가사를 지원하는 서비스 로봇, 음원 추적하여 침입자를 감시하는 감시 카메라, 다자 영상 회의에 쓰이는 비디오 카메라 등과 같은 기기에 사용한다.The technology that tracks the location of sound sources is used not only in household appliances, but also in devices such as service robots that support housekeeping, surveillance cameras that track sound sources to monitor intruders, and video cameras used in multi-party video conferencing.

도 1은 본 발명의 일 실시 예에 따른 레이더 기반 음성 인식 서비스 장치의 개략적인 블록 구성도이다.
도 2는 본 발명에 따른 레이더 기반 음성 인식 서비스 장치의 상면도의 일 예이다.
도 3은 본 발명의 일 실시 예에 따른 레이더 기반 음성 인식 서비스 방법을 설명하기 위한 순서도이다. 1 is a schematic block diagram of a radar-based speech recognition service device according to an embodiment of the present invention.
2 is an example of a top view of a radar-based speech recognition service device according to the present invention.
3 is a flowchart illustrating a radar-based speech recognition service method according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여, 바람직한 실시 예에 따른 레이더 기반 음성 인식 서비스 장치 및 방법에 대해 상세히 설명하면 다음과 같다. 여기서, 동일한 구성에대해서는 동일부호를 사용하며, 반복되는 설명, 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략한다. 발명의 실시형태는 당업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, a radar-based speech recognition service apparatus and method according to a preferred embodiment will be described in detail with reference to the accompanying drawings. Here, the same reference numerals are used for the same components, and repeated descriptions and detailed descriptions of well-known functions and components that may unnecessarily obscure the subject matter of the invention are omitted. Embodiments of the invention are provided to more fully describe the invention to those skilled in the art. Accordingly, the shape and size of elements in the drawings may be exaggerated for a more clear description.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램인스트럭션들(실행 엔진)에 의해 수행될 수도 있으며, 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다.Combinations of each block in the accompanying block diagrams and steps of the flow charts may be performed by computer program instructions (execution engines), these computer program instructions being incorporated into a processor of a general purpose computer, special purpose computer, or other programmable data processing equipment. Since it can be mounted, the instructions executed through a processor of a computer or other programmable data processing equipment create a means to perform the functions described in each block of the block diagram or in each step of the flowchart.

이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다.These computer program instructions can also be stored in computer readable or computer readable memory that can be oriented to a computer or other programmable data processing equipment to implement functionality in a particular way, so that computer readable or computer readable memory The instructions stored in it are also possible to produce an article of manufacture containing instructions means for performing the functions described in each block of the block diagram or in each step of the flowchart.

그리고 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명되는 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.And since computer program instructions may be mounted on a computer or other programmable data processing equipment, a series of operational steps are performed on the computer or other programmable data processing equipment to create a process that is executed by the computer to generate a computer or other programmable It is also possible for instructions to perform data processing equipment to provide steps for performing the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능들을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있으며, 몇 가지 대체 실시 예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하며, 또한 그 블록들 또는 단계들이 필요에 따라 해당하는 기능의 역순으로 수행되는 것도 가능하다.In addition, each block or each step can represent a module, segment, or portion of code that includes one or more executable instructions for executing specified logical functions, and in some alternative embodiments referred to in blocks or steps It should be noted that it is possible for functions to occur out of sequence. For example, two blocks or steps shown in succession may in fact be performed substantially simultaneously, and it is also possible that the blocks or steps are performed in the reverse order of the corresponding function as necessary.

이하, 첨부 도면을 참조하여 본 발명의 실시 예를 상세하게 설명한다. 그러나 다음에 예시하는 본 발명의 실시 예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시 예에 한정되는 것은 아니다. 본 발명의 실시 예는 당업계에서 통상의 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공된다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the embodiments of the present invention exemplified below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art.

도 1은 본 발명의 일 실시 예에 따른 레이더 기반 음성 인식 서비스 장치의 개략적인 블록 구성도이고, 도 2는 본 발명에 따른 레이더 기반 음성 인식 서비스 장치의 상면도의 일 예이다. 1 is a schematic block diagram of a radar-based speech recognition service device according to an embodiment of the present invention, and FIG. 2 is an example of a top view of a radar-based speech recognition service device according to the present invention.

도 1을 참조하면, 레이더 기반 음성 인식 서비스 장치(이하 '장치'로 기재함)(100)는 음향 획득부(110), 음원 분리부(120), 후처리부(130), 레이더(141), 음원 위치 추적부(142) 및 제어부(150)를 포함한다. Referring to FIG. 1, a radar-based speech recognition service device (hereinafter referred to as 'device') 100 includes a sound acquisition unit 110, a sound source separation unit 120, a post-processing unit 130, and a radar 141, It includes a sound source location tracking unit 142 and the control unit 150.

이러한 레이더 기반 음성 인식 서비스 장치(100)는 가정의 거실 등과 같이 사용자가 거주하고 있는 공간에 설치되어 있는 단말기로서, 화자의 서비스 요청 음성을 포함하는 음향을 마이크(100)를 통해 입력받으며, 음원 분리부(120)에서 해당 음향을 분석하여 화자를 식별하며, 식별된 화자의 음성의 의미에 상응하는 서비스를 후처리부(130)에서 수행하게 된다. 여기서, 음성의 의미 상응하는 서비스는 예컨대, "거실 불을 켜줘"라는 음성이 입력될 경우, 거실 조명의 온/오프를 제어하는 것일 수 있다. 그러나, 이는 일 예일 뿐, 본 발명은 이에 한정되지 않는다. 레이더 기반 음성 인식 서비스 장치(100)는 음원 위치를 추적하는 기술이 적용된 가전제품 뿐만 아니라 가사를 지원하는 서비스 로봇, 음원 추적하여 침입자를 감시하는 감시 카메라, 다자 영상 회의에 쓰이는 비디오 카메라 등과 같은 기기일 수도 있다. The radar-based speech recognition service device 100 is a terminal installed in a user's living space, such as a living room at home, receives sound including a speaker's service request voice through the microphone 100, and separates a sound source The speaker 120 identifies the speaker by analyzing the corresponding sound, and the post-processing unit 130 performs a service corresponding to the meaning of the identified speaker's voice. Here, the service corresponding to the meaning of the voice may be, for example, to control on / off of the lighting in the living room when the voice “turn on the living room light” is input. However, this is only an example, and the present invention is not limited thereto. The radar-based speech recognition service device 100 is a device such as a household appliance to which a sound source location tracking technology is applied, as well as a service robot that supports lyrics, a surveillance camera that monitors an intruder by tracking a sound source, and a video camera used in a multi-party video conference. It might be.

일 실시 예에 따라, 음향 획득부(110)는 복수의 마이크들(111a, 111b,..)이 이 방사 배치된 마이크로폰 어레이(Microphone array)(111)를 포함할 수 있다. 도 2를 참조하면, 마이크로폰 어레이(111)는 장치(100)의 원통형 하우징의 주변을 둘러싸는 방사형의 형태로 복수의 마이크들(111a, 111b,..)이 구비된 형태로 이루어질 수 있다. 그러나, 이는 일 예일 뿐, 본 발명은 이에 한정되지 않는다. 레이더 기반 음성 인식 장치(100)는 다양한 형태를 가질 수도 있고, 마이크로폰 어레이(110)는 레이더 기반 음성 인식 장치(100)와 유/무선 통신되는 별도의 구성을 가질 수도 있다. 또한, 다른 실시 예에 따라, 복수의 마이크들(111a, 111b,..)은 회전 가능한 지지대(미도시)에 의해 장치(100)의 하우징에 부착되어, 제어부(150)의 제어에 의해 지지대가 방향 전환 가능하도록 구성될 수도 있다. According to an embodiment, the sound acquiring unit 110 may include a microphone array 111 in which a plurality of microphones 111a, 111b, .. are disposed. Referring to FIG. 2, the microphone array 111 may be formed in a shape in which a plurality of microphones 111a, 111b, .. are provided in a radial form surrounding the periphery of the cylindrical housing of the device 100. However, this is only an example, and the present invention is not limited thereto. The radar-based speech recognition apparatus 100 may have various forms, and the microphone array 110 may have a separate configuration for wired / wireless communication with the radar-based speech recognition apparatus 100. In addition, according to another embodiment, a plurality of microphones (111a, 111b, ..) is attached to the housing of the device 100 by a rotatable support (not shown), the support is controlled by the control of the control unit 150 It may be configured to be redirected.

마이크로폰 어레이(111)의 주변에는 다양한 음원들(Sound Source: SS 1,.., SS i,.., SS N)이 존재할 수 있다. 여기서, 음원들은 하나 또는 둘 이상의 사용자와, 주변의 TV 장치를 포함하여 다양한 잡음을 발생시키는 전자 장치 등이 포함될 수 있다. 마이크로폰 어레이(111)를 구성하는 복수의 마이크들(111a, 111b,..) 각각은 지향된 위치에 존재하는 음원을 중심으로 다양한 음원에서 발생시키는 음향을 입력받아 출력할 수 있다. 또한, 마이크로폰 어레이(111)는 증폭부(112)에 연결되어, 마이크로폰 어레이(111)에 의해 입력된 음향이 증폭될 수 있다. 이때, 본 발명의 일 실시 예에 따라, 증폭부(112)는 마이크로폰 어레이(111)를 구성하는 복수의 마이크들(111a, 111b,..) 각각으로부터 출력되는 음향을 증폭시키는 복수의 증폭기들로 구성되어, 복수의 증폭기들은 각각 그 이득(Gain)이 독립적으로 제어될 수 있다. A variety of sound sources (Sound Source: SS 1, .., SS i, .., SS N) may exist around the microphone array 111. Here, the sound sources may include one or more users, and electronic devices that generate various noises, including nearby TV devices. Each of the plurality of microphones 111a, 111b, .. constituting the microphone array 111 may receive and output sound generated from various sound sources centering on a sound source present in a oriented position. In addition, the microphone array 111 is connected to the amplification unit 112, so that the sound input by the microphone array 111 can be amplified. At this time, according to an embodiment of the present invention, the amplification unit 112 is a plurality of amplifiers for amplifying the sound output from each of the plurality of microphones (111a, 111b, ..) constituting the microphone array 111 By configuring, the gain of each of the plurality of amplifiers can be independently controlled.

다시 도 1을 참조하면, 레이더(Radar)(141)는 장치(100)가 전원 구동됨(on)에 따라, 주변에 레이더 신호를 송출하고, 송출된 레이더 신호가 주변에 물체에 반사되어 피드백된 반사 신호를 수신한다. 본 발명의 일 실시 예에 따라, 레이더(Radar)(141) 또한 마이크로폰 어레이(111)와 같이 장치(100)의 하우징의 주변을 둘러싸는 방사형의 형태로 복수의 레이더들(미도시)이 형성되는 형태를 가질 수 있다. 따라서, 레이더(Radar)(141)는 주변에 방사형으로 복수의 레이더 신호들을 송출하여 반사된 신호를 수신할 수 있다. 또한, 다른 실시 예에 따라, 복수의 레이더들은 회전 가능한 지지대(미도시)에 의해 장치(100)의 하우징에 부착되어, 제어부(150)의 제어에 의해 지지대가 방향 전환 가능하도록 구성될 수도 있다. Referring back to FIG. 1, the radar 141 transmits a radar signal to the surroundings as the device 100 is powered on, and the transmitted radar signal is reflected back to an object and fed back. Reflected signal is received. According to an embodiment of the present invention, a radar 141 is also formed with a plurality of radars (not shown) in a radial form surrounding the housing of the device 100, such as a microphone array 111. It can take the form. Accordingly, the radar 141 may transmit a plurality of radar signals radially around it to receive the reflected signal. In addition, according to another embodiment, the plurality of radars may be configured to be attached to the housing of the device 100 by a rotatable support (not shown), so that the support can be switched by control of the control unit 150.

음원 위치 추적부(142)는 레이더(141)의 송/수신 신호를 분석하여 화자의 위치를 탐색한다. 즉, 음원 위치 추적부(142)는 송/수신 신호를 통해 반사체와의 거리를 계산하고, 계산된 반사체와의 거리에 따라 음원 위치를 추적한다. 이때, 음원 위치 추적부(142)는 사람이 존재하지 않는 상황에서 주변에 존재하는 정적 반사체와의 거리 정보들이 미리 저장되어 있을 수 있다. 여기서, 정적 반사체는 예컨대, 거실의 TV 장치, 벽 및 쇼파 등을 포함하여 이동하지 않는 반사체일 수 있다. The sound source location tracking unit 142 analyzes the transmission / reception signal of the radar 141 to search for the speaker's location. That is, the sound source location tracking unit 142 calculates the distance from the reflector through the transmit / receive signal, and tracks the sound source location according to the calculated distance from the reflector. At this time, the sound source location tracking unit 142 may be stored in advance, the distance information of the static reflector in the surroundings in the absence of a person. Here, the static reflector may be a reflector that does not move, including, for example, a TV device in a living room, a wall, and a sofa.

따라서, 음원 위치 추적부(142)는 미리 저장된 주변의 정적 반사체와의 거리 정보들과 상이한 동적 반사체의 거리 정보가 산출될 경우, 해당 동적 반사체를 화자로 판단할 수 있다. 따라서, TV 장치가 구동되어 있어 사운드를 발생시키더라도 TV 장치는 미리 저장된 정적 반사체에 해당하므로, 음원 위치 추적부(142)는 TV 장치를 화자로 판단하지 않을 수 있다. 또한, 음원 위치 추적부(142)는 화자를 둘 이상으로 판단할 수도 있다. 또한, 음원 위치 추적부(142)는 화자의 형태 윤곽 정보도 검출할 수 있다. 예컨대, 키, 자세(앉았는지 서 있는지), 사람과 TV의 구별해낼 수 있다. Therefore, when the distance information of the dynamic reflector different from the distance information of the surrounding static reflector stored in advance is calculated, the sound source location tracking unit 142 may determine the dynamic reflector as a speaker. Therefore, even if the TV device is driven to generate sound, the TV device corresponds to a static reflector stored in advance, so the sound source location tracking unit 142 may not determine the TV device as a speaker. Also, the sound source location tracking unit 142 may determine more than one speaker. In addition, the sound source location tracking unit 142 may also detect the shape contour information of the speaker. For example, height, posture (whether sitting or standing), and human and TV can be distinguished.

음원 분리부(120)는 음향 획득부(110)로부터 입력되는 음향에서 화자의 음성을 분리해낸다. 즉, 음향 획득부(110)에는 다양한 위치의 음원들이 발생시키는 음향들이 섞여있는 형태로 입력되는데, 음원 분리부(120)는 이러한 음향에서 화자의 음성을 분리해내는 것이다. 이를 위해 화자의 위치를 파악하는 것이 중요한데, 종래에는 다수의 마이크로폰을 직렬이나 병렬로 배치하여 마이크로폰 어레이를 구성하고 이 마이크로폰 어레이에 입력된 신호를 분석하여 음원 위치를 파악하려는 연구 개발에 많은 노력이 이루어졌다. 그런데, 이러한 기존의 방식은 음원 위치 추적을 위해 전(全) 방위에 대해 검색하고 시간 영역에서 주파수 영역으로 변환하고 이를 역변환하는 과정이 요구되므로 음원 추적시 계산량이 많아진다.The sound source separation unit 120 separates the speaker's voice from the sound input from the sound acquisition unit 110. That is, the sound acquisition unit 110 is input in a form in which sounds generated by sound sources of various locations are mixed, and the sound source separation unit 120 separates the speaker's voice from the sound. To this end, it is important to locate the speaker. In the past, a lot of efforts have been made in research and development to locate the sound source by constructing a microphone array by arranging a number of microphones in series or in parallel and analyzing the signals input to the microphone array. lost. However, such a conventional method requires a process of searching for all directions to track the sound source position, converting it from the time domain to the frequency domain, and inversely transforming it, thereby increasing the computation amount when tracking the sound source.

따라서, 본 발명에서는 이러한 마이크로폰 어레이에 입력된 신호를 분석하여 음원 위치를 파악하는 것이 아니라, 음원 위치 추적부(142)가 음향 획득부(110)로부터 음향이 입력되기 이전에 레이더(141) 송/수신 신호를 분석하여 음원 위치를 추적한다. Therefore, in the present invention, the signal input to the microphone array is not analyzed to determine the sound source location, but the radar 141 is transmitted / transmitted before the sound source location tracking unit 142 inputs sound from the sound acquisition unit 110. Analyze the received signal to track the location of the sound source.

따라서, 음원 분리부(120)는 추적된 음원 위치 정보를 기반으로 음향 획득부(110)를 통해 입력되는 음향에서 화자의 음성을 분리해낸다. 이때, 음원 분리를 위해 다양한 알고리즘이 채택 가능하다. 일 예로, 음원 분리부(120)에서는 능동 잡음 제거(Active Noise Cancellation) 기술이 적용될 수 있도록 하기 위해, 추적된 음원 위치에 발생되는 음향을 추정해내고, ANC 기술을 적용하여 추정된 음향으로부터 주변 음향을 제거하는 방식으로 잡음을 제거한다. 이때, 마이크로폰 어레이(111)를 구성하는 마이크들(111a, 111b,..) 중 화자를 향하는 마이크로부터 입력되는 음향에 화자의 음성이 가장 많이 포함되어 있으므로, 잡음 제거시 고려해야 할 사항 중 중요한 것이 해당 마이크의 전방에서 들어오는 음성 신호를 최대한 손실하지 않고, 전방 이외의 방향에서 들어오는 잡음 신호가 제거되도록 하는 것이다. Accordingly, the sound source separation unit 120 separates the speaker's voice from the sound input through the sound acquisition unit 110 based on the tracked sound source location information. At this time, various algorithms can be adopted for sound source separation. For example, the sound source separation unit 120 estimates the sound generated at the tracked sound source location so that the Active Noise Cancellation technology can be applied, and applies the ANC technology to the ambient sound from the estimated sound. Remove noise by removing it. At this time, among the microphones (111a, 111b, ..) constituting the microphone array 111, the speaker's voice is most often included in the sound input from the microphone toward the speaker, so it is important to consider the noise removal considerations. The noise signal coming from the front of the microphone is not lost as much as possible, and the noise signal coming from the direction other than the front is removed.

이를 위해, 제어부(150)는 음향 획득부(110)의 마이크로폰 어레이(111)를 구성하는 복수의 마이크들(111a, 111b,..) 중 화자 위치를 향하는 마이크에 연결된 증폭기와 그 외의 마이크들에 연결된 증폭기들의 이득을 상이하게 제어한다. 즉, 음원 위치에 상응하는 마이크에 연결된 증폭기의 이득은 크게 하고, 그 외의 증폭기들의 이득은 작게 조절한다. 이로써, 음원 분리부(120)는 해당 마이크의 전방에서 들어오는 음성 신호를 최대한 손실하지 않고, 전방 이외의 방향에서 들어오는 잡음신호가 제거되도록 할 수 있다. To this end, the control unit 150 is connected to an amplifier and other microphones connected to a microphone facing a speaker position among a plurality of microphones 111a, 111b, .. constituting the microphone array 111 of the sound acquisition unit 110 The gain of the connected amplifiers is controlled differently. That is, the gain of the amplifier connected to the microphone corresponding to the location of the sound source is increased, and the gain of other amplifiers is adjusted to be small. As a result, the sound source separation unit 120 may allow the noise signal coming from a direction other than the front to be removed without maximally losing the voice signal coming from the front of the corresponding microphone.

또한, 다른 실시 예에 따라, 제어부(150)는 신호대잡음비 강화(SNR enhancement), 마이크로폰 어레이 지향(micophone array orientation) 등의 알고리즘을 이용하여 화자 음원을 최대한 손실하지 않고 잡음 신호가 제거되도록 할 수도 있다. 예컨대, 전술한 바와 같이 마이크로폰 어레이(111)의 복수의 마이크들(111a, 111b,..)의 지향 위치는 조절 가능하도록 구성될 수 있고, 제어부(150)는 음원 위치 추적부(142)에 의해 추적된 음원 위치를 기반으로 복수의 마이크들(111a, 111b,..)이 해당 음원 위치를 향하도록 마이크의 지향 위치를 조절할 수도 있다. In addition, according to another embodiment, the control unit 150 may use an algorithm such as signal-to-noise ratio enhancement (SNR enhancement), microphone array orientation, or the like to remove the noise signal without maximally losing the speaker sound source. . For example, as described above, the directional positions of the plurality of microphones 111a, 111b, .. of the microphone array 111 may be configured to be adjustable, and the control unit 150 may be controlled by the sound source position tracking unit 142. It is also possible to adjust the directional position of the microphone so that the plurality of microphones 111a, 111b, .. are directed to the corresponding sound source position based on the tracked sound source position.

또한, 전술한 바와 같이 마이크로폰 어레이(111)에 연결된 증폭부(120)의 개별 이득이 제어되거나, 마이크로폰 어레이(111)가 향하는 방향이 조절될 경우, 음원 분리부(120)에 의한 음원 분리(Source Separation)가 생략되고, 음성 인식부(131)에서 간략하고 빠른(Simple & Fast) 알고리즘을 통해 음성 인식을 수행할 수도 있다. In addition, as described above, when the individual gain of the amplifier 120 connected to the microphone array 111 is controlled or the direction in which the microphone array 111 faces is adjusted, the sound source separation by the sound source separation unit 120 (Source) Separation) is omitted, and the speech recognition unit 131 may perform speech recognition through a simple & fast algorithm.

후처리부(130)는 음성 인식부(131) 및 대응 처리부(132)를 포함한다. 음성 인식부(131)는 음원 분리부(120)로부터 분리된 화자의 음성이 입력됨에 따라, 해당 음성의 의미를 인식해낼 수 있다. 그런 후, 대응 처리부(132)는 인식된 음성의 의미에 상응하는 서비스 동작이 수행되도록 한다. 예컨대, 음성의 의미에 상응하는 서비스는 외부 전자 기기를 제어하는 것으로, 후처리부(130)는 유/무선 통신을 통해 외부 전자 기기에 해당 제어 명령을 송신하는 기능을 수행할 수 있다. 여기서, 외부 전자 기기(미도시)는 제어 명령에 따라 동작되는 장비로서, 가정 내에 설치되어 있는 스마트 TV, 서비스 제공 서버(200)와 연동되는 조명 기기, 난방 기기, 에어컨 등의 다양한 사물 인터넷(IoT)용 장비가 될 수 있을 것이다.The post-processing unit 130 includes a voice recognition unit 131 and a corresponding processing unit 132. The voice recognition unit 131 may recognize the meaning of the voice as the voice of the speaker separated from the sound source separation unit 120 is input. Thereafter, the corresponding processing unit 132 allows the service operation corresponding to the recognized meaning of the voice to be performed. For example, a service corresponding to the meaning of voice controls an external electronic device, and the post-processing unit 130 may perform a function of transmitting a corresponding control command to the external electronic device through wired / wireless communication. Here, the external electronic device (not shown) is a device operated according to a control command, and various Internet of Things (IoT) such as a smart TV installed in the home, a lighting device interlocked with the service providing server 200, a heating device, and an air conditioner ).

이때, 도면에는 도시되어 있지 않지만, 후처리부(130)는 서비스 제공 메시지를 스피커(미도시)를 통해 출력할 수도 있다. 또한, 레이더 기반 음성 인식 서비스 장치(100)는 유/무선 통신을 통해 외부의 서비스 제공 서버(미도시)에 음성에서 인식된 서비스 제공 요청 메시지를 전달하여 서비스 제공 서버로부터 피드백되는 맞춤형 서비스 제안 메시지를 수신하여, 그에 상응하는 서비스를 제공할 수도 있다. At this time, although not shown in the drawing, the post-processing unit 130 may output a service provision message through a speaker (not shown). In addition, the radar-based voice recognition service device 100 transmits a service providing request message recognized in voice to an external service providing server (not shown) through wired / wireless communication to receive a customized service proposal message fed back from the service providing server. Upon reception, a corresponding service may be provided.

도 3은 본 발명의 일 실시 예에 따른 레이더 기반 음성 인식 서비스 방법을 설명하기 위한 순서도이다. 3 is a flowchart illustrating a radar-based speech recognition service method according to an embodiment of the present invention.

도 3을 참조하면, 장치(100)는 전원 구동됨(on)에 따라, 주변에 레이더 신호를 송출하고, 송출된 레이더 신호가 주변에 물체에 반사되어 피드백된 반사 신호를 수신한다(S210). 본 발명의 일 실시 예에 따라, 레이더(Radar)(141)는 주변에 방사형으로 복수의 레이더 신호들을 송출하여 반사된 신호를 수신할 수 있다. Referring to FIG. 3, as the device 100 is powered on, a radar signal is transmitted to the surroundings, and the transmitted radar signal is reflected by an object in the surroundings to receive a feedback reflected signal (S210). According to an embodiment of the present invention, the radar 141 may receive a reflected signal by transmitting a plurality of radar signals radially around.

장치(100)는 레이더 송/수신 신호를 분석하여 화자의 위치를 탐색한다(S220). 즉, 송/수신 신호를 통해 반사체와의 거리를 계산하고, 계산된 반사체와의 거리에 따라 음원 위치를 추적한다. 이때, 장치(100)는 사람이 존재하지 않는 상황에서 주변에 존재하는 정적 반사체와의 거리 정보들이 미리 저장되어 있을 수 있다. 여기서, 정적 반사체는 예컨대, 거실의 TV 장치, 벽 및 쇼파 등을 포함하여 이동하지 않는 반사체일 수 있다. 따라서, 장치(100)는 미리 저장된 주변의 정적 반사체와의 거리 정보들과 상이한 동적 반사체의 거리 정보가 산출될 경우, 해당 동적 반사체를 화자로 판단할 수 있다. 따라서, TV 장치가 구동되어 있어 사운드를 발생시키더라도 TV 장치는 미리 저장된 정적 반사체에 해당하므로, 장치(100)는 TV 장치를 화자로 판단하지 않을 수 있다. 또한, 이때 장치(100)는 화자를 둘 이상으로 판단할 수도 있다. The device 100 analyzes the radar transmission / reception signal to search for the speaker's location (S220). That is, the distance from the reflector is calculated through the transmission / reception signal, and the sound source location is tracked according to the calculated distance from the reflector. At this time, the device 100 may store distance information of a static reflector existing in the surroundings in the absence of a person. Here, the static reflector may be a reflector that does not move, including, for example, a TV device in a living room, a wall, and a sofa. Accordingly, when the distance information of the dynamic reflector different from the distance information of the surrounding static reflector stored in advance is calculated, the apparatus 100 may determine the corresponding dynamic reflector as a speaker. Therefore, even if the TV device is driven to generate sound, the TV device corresponds to a static reflector stored in advance, so the device 100 may not determine the TV device as a speaker. In addition, at this time, the device 100 may determine more than one speaker.

다음으로, 장치(100)는 탐색된 화자의 위치 정보를 기반으로 구성 요소들을 제어할 수 있다(S230). 일 실시 예에 따라, 장치(100)는 마이크로폰 어레이(111)를 구성하는 복수의 마이크들(111a, 111b,..) 각각으로부터 출력되는 음향을 증폭시키는 복수의 증폭기들 각각의 이득(Gain)이 독립적으로 제어할 수 있다. 즉, 탐색된 음원 위치에 상응하는 마이크에 연결된 증폭기의 이득은 크게 하고, 그 외의 증폭기들의 이득은 작게 조절한다. 다른 실시 예에 따라, 음원 위치를 기반으로 복수의 마이크들(111a, 111b,..)이 해당 음원 위치를 향하도록 마이크의 지향 위치를 조절할 수도 있다. Next, the device 100 may control the components based on the searched speaker's location information (S230). According to one embodiment, the device 100 has a gain of each of a plurality of amplifiers for amplifying the sound output from each of the plurality of microphones 111a, 111b, .. constituting the microphone array 111. It can be controlled independently. That is, the gain of the amplifier connected to the microphone corresponding to the searched sound source position is increased, and the gain of other amplifiers is adjusted to be small. According to another embodiment, the directional position of the microphone may be adjusted so that the plurality of microphones 111a, 111b, .. face the corresponding sound source position based on the sound source position.

전술한 S210 내지 S230를 통해 화자 위치에 따른 음성 인식 준비가 완료되면, 장치(100)는 음성이 입력되는지를 감지한다(S240). 이로써, 예컨대 사용자가 호명하는 '알렉사'라는 웨이크업 워드 전체를 처음부터 인식할 수 있도록 한다. When the speech recognition preparation according to the speaker position is completed through the above-described S210 to S230, the device 100 detects whether a voice is input (S240). In this way, for example, the entire wake-up word called 'Alexa' called by the user can be recognized from the beginning.

S240의 감지 결과 음성 입력이 감지되지 않을 경우, 장치(100)는 S210 단계로 돌아가서 음원 위치 탐색을 계속적으로 수행한다. 그리고, S250 내지 S270을 수행하는 동시에 S210 내지 S230가 계속적으로 수행될 수 있다. 이는 화자가 이동하더라도 음원 분리가 신속히 이루어지도록 하기 위함이다. 즉, 주변 사람의 움직임을 검출하고 추적하여 그 사람과의 대화의 콘텍스트(context)가 이어지도록 할 수 있다. If the voice input is not detected as a result of the detection of S240, the device 100 returns to step S210 and continuously searches for a sound source location. And, while performing S250 to S270, S210 to S230 may be continuously performed. This is to ensure that the sound source is separated quickly even if the speaker moves. That is, it is possible to detect and track the movement of a person around, so that the context of the conversation with the person continues.

반면, S240의 감지 결과 음성 입력이 감지될 경우, 장치(100)는 탐색된 화자 위치 기반으로 음원을 분리한다. 즉, 입력되는 음향에서 화자의 음성을 분리해낸다. 즉, 다양한 위치의 음원들이 발생시키는 음향들이 섞여있는 형태로 입력되는데, 이러한 음향에서 화자의 음성을 분리해내는 것이다. 이때, 음원 분리를 위해 다양한 알고리즘이 채택 가능하다. 일 예로, 능동 잡음 제거(Active Noise Cancellation) 기술이 적용될 수 있도록 하기 위해, 추적된 음원 위치에 발생되는 음향을 추정해내고, ANC 기술을 적용하여 추정된 음향으로부터 주변 음향을 제거하는 방식으로 잡음을 제거한다. 이때, 마이크로폰 어레이(111)를 구성하는 마이크들(111a, 111b,..) 중 화자를 향하는 마이크로부터 입력되는 음향에 화자의 음성이 가장 많이 포함되어 있으므로, 잡음 제거시 고려해야 할 사항 중 중요한 것이 해당 마이크의 전방에서 들어오는 음성 신호를 최대한 손실하지 않고, 전방 이외의 방향에서 들어오는 잡음 신호가 제거되도록 하는 것이다. 또한, 전술한 바와 같이 마이크로폰 어레이(111)에 연결된 증폭부(120)의 개별 이득이 제어되거나, 마이크로폰 어레이(111)가 향하는 방향이 조절될 경우, 음원 분리(Source Separation) 단계(S250)가 생략되고, S260에서 간략하고 빠른(Simple & Fast) 알고리즘을 통해 음성 인식을 수행할 수도 있다. On the other hand, when a voice input is detected as a result of the detection of S240, the device 100 separates the sound source based on the detected speaker position. That is, the speaker's voice is separated from the input sound. That is, sounds generated by sound sources of various locations are input in a mixed form, which separates the speaker's voice from these sounds. At this time, various algorithms can be adopted for sound source separation. For example, in order to enable active noise cancellation technology, noise generated in the tracked sound source is estimated, and ANC technology is applied to remove ambient sound from the estimated sound. Remove it. At this time, among the microphones (111a, 111b, ..) constituting the microphone array 111, the speaker's voice is most often included in the sound input from the microphone toward the speaker, so it is important to consider the noise removal considerations. The noise signal coming from the front of the microphone is not lost as much as possible, and the noise signal coming from the direction other than the front is removed. In addition, as described above, when the individual gain of the amplifier 120 connected to the microphone array 111 is controlled or the direction in which the microphone array 111 is directed is adjusted, the source separation step (S250) is omitted. In S260, speech recognition may be performed through a simple and fast algorithm.

장치(100)는 분리된 화자의 음성의 의미를 인식해낸다(S260). 그런 후, 식된 음성의 의미에 상응하는 서비스 동작이 수행되도록 한다(S270). 예컨대, 음성의 의미에 상응하는 서비스는 외부 전자 기기를 제어하는 것으로, 유/무선 통신을 통해 외부 전자 기기에 해당 제어 명령을 송신하는 기능을 수행할 수 있다. The device 100 recognizes the meaning of the voice of the separated speaker (S260). Then, a service operation corresponding to the meaning of the cooled voice is performed (S270). For example, a service corresponding to the meaning of voice controls an external electronic device, and may perform a function of transmitting a corresponding control command to the external electronic device through wired / wireless communication.

Claims

With radar,
A sound source location tracking unit that analyzes the radar's transmit / receive signal to search for the speaker's location;
A sound acquisition unit that receives ambient sound,
And a sound source separation unit for separating the speaker's voice from the sound input through the sound acquisition unit based on the speaker position information tracked by the sound source location tracking unit,
Radar-based speech recognition service device including a post-processing unit for recognizing the speaker's speech separated by the sound source separation unit.

According to claim 1, The sound acquisition unit
A microphone array consisting of a plurality of microphones arranged radiatedly,
It includes a plurality of amplification units for amplifying a plurality of sound signals output from each of the plurality of microphones to output to the sound source separation unit,
A radar-based speech recognition service device further comprising a control unit for differentially adjusting the gain of each of the plurality of amplification units based on the speaker location information searched by the sound source location tracking unit.

According to claim 1, The sound source location tracking unit
A radar that stores radar transmit / receive signal information in advance on at least one static reflector in the pre-stored environment, and determines the speaker's position if the analyzed radar transmit / receive signal does not match the radar transmit / receive signal of the stored static reflector. Based speech recognition service device.

Analyzing the radar transmit / receive signals to track the speaker's location,
Separating the speaker's voice from the input sound based on the tracked speaker location information,
A radar-based speech recognition service method comprising the step of recognizing the speech of the separated speaker.

According to claim 4,
The radar-based speech recognition service method further comprising the step of differentially adjusting the gain of each of the plurality of amplifying units for amplifying a plurality of sound signals output from each of the plurality of microphones based on the tracked speaker location information.