KR20210029595A

KR20210029595A - Keyword Spotting Apparatus, Method and Computer Readable Recording Medium Thereof

Info

Publication number: KR20210029595A
Application number: KR1020190111046A
Authority: KR
Inventors: 안상일; 최승우; 서석준; 신범준
Original assignee: 주식회사 하이퍼커넥트
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2021-03-16
Also published as: KR102374525B1

Abstract

Disclosed are a keyword spotting apparatus, a keyword spotting method to quickly extract a voice keyword with high precision, and a computer-readable recording medium thereof. According to one embodiment of the present invention, a keyword spotting method using an artificial neural network comprises the following steps: obtaining an input feature map from an input voice; performing a first convolution operation with the input feature map with respect to each of n different filters having a width of w1 while having the same channel length as that of the input feature map, wherein the length of w1 is smaller than a width of the input feature map; performing a second convolution operation with a result of the first convolution operation with respect to each of the different filters having the same channel length as that of the input feature map; storing a result of the previous convolution operation as an output feature map; and extracting a voice keyword by applying the output feature map to a learned machine learning model.

Description

Keyword Spotting Apparatus, Method and Computer Readable Recording Medium Thereof}

본 발명은 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체에 관한 것으로, 보다 구체적으로는 높은 정확도를 유지하면서도 매우 빠르게 키워드를 스폿팅할 수 있는 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체에 관한 것이다.The present invention relates to a keyword spotting apparatus, a method, and a computer-readable recording medium, and more specifically, to a keyword spotting apparatus, a method, and a computer-readable recording medium capable of spotting keywords very quickly while maintaining high accuracy. About.

키워드 스폿팅(Keyword Spotting, KWS)은 스마트 디바이스에서 스피치 기반의 사용자 인터랙션에 매우 중요한 역할을 한다. 최근 딥러닝 분야에의 기술 발전은 컨볼루션 뉴럴 네트워크(Convolution Neural Network, CNN)의 정확성과 강인성 때문에 키워드 스폿팅 분야에 CNN의 적용을 이끌고 있다.Keyword Spotting (KWS) plays a very important role in speech-based user interaction in smart devices. Recent technological advances in the field of deep learning are leading the application of CNN to the field of keyword spotting because of the accuracy and robustness of the Convolution Neural Network (CNN).

키워드 스폿팅 시스템이 직면한 가장 중요한 과제는 높은 정확성과 낮은 레이턴시(latency) 사이의 트레이드-오프(trade-off)를 해결하는 것이다. 전통적인 컨볼루션 기반의 키워드 스폿팅 접근법은 적절한 수준의 퍼포먼스를 얻기 위해 매우 많은 양의 연산을 필요로 함이 알려진 이후로 이는 매우 중요한 이슈가 되고 있다.The most important challenge faced by keyword spotting systems is resolving the trade-off between high accuracy and low latency. This has become a very important issue since it was known that the traditional convolution-based keyword spotting approach requires a very large amount of computation to obtain an appropriate level of performance.

그럼에도 불구하고, 모바일 장치에서 키워드 스포팅 모델의 실제 레이턴시에 대한 연구는 활발하지 않은 편이다.Nevertheless, research on the actual latency of the keyword spotting model in mobile devices is not active.

본 발명은 빠르게 높은 정확도로 음성 키워드를 추출할 수 있는 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체를 제공하는 것을 목적으로 한다.An object of the present invention is to provide a keyword spotting apparatus, a method, and a computer-readable recording medium capable of rapidly extracting voice keywords with high accuracy.

본 발명의 일 실시예에 따른 키워드 스폿팅 방법은, 인공 신경망을 이용한 키워드 스폿팅 방법에 있어서, 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득하는 단계, 상기 입력 피처 맵과 동일한 채널(channel) 길이를 갖되, 폭(width)이 w1 - w1의 길이는 상기 입력 피처 맵의 폭 보다 작은 - 인 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션(convolution) 연산을 수행하는 단계, 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행하는 단계, 앞선 컨볼루션 연산의 결과를 출력 피처 맵(Output Feature Map)으로 저장하는 단계 및 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출하는 단계를 포함한다.In the keyword spotting method according to an embodiment of the present invention, in the keyword spotting method using an artificial neural network, acquiring an input feature map from an input voice, the same channel as the input feature map. ) A first convolution operation with the input feature map is performed for each of n different filters having a length, but whose width is w1-w1 is smaller than the width of the input feature map. Performing a second convolution operation with a result of the first convolution operation for each of different filters having the same channel length as the input feature map, and outputting the result of the previous convolution operation And storing as an output feature map and extracting a voice keyword by applying the output feature map to the learned machine learning model.

또한, 상기 제1 컨볼루션 연산의 스트라이드(stride) 값은 1일 수 있다.In addition, a stride value of the first convolution operation may be 1.

또한, 상기 제2 컨볼루션 연산을 수행하는 단계는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제1 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 서브 컨볼루션 연산의 결과와의 제2 서브 컨볼루션 연산을 수행하는 단계, 폭이 1인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제3 서브 컨볼루션 연산을 수행하는 단계 및 상기 제2 및 제3 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.In addition, the performing of the second convolution operation may include performing a first sub-convolution operation between m different filters having a width w2 and a result of the first convolution operation, m having a width w2 Performing a second sub-convolution operation with the result of the first sub-convolution operation with two different filters, and a third sub of m different filters having a width of 1 and the result of the first convolution operation It may include performing a convolution operation and summing the results of the second and third sub-convolution operations.

또한, 상기 제1 내지 제3 서브 컨볼루션 연산의 스트라이드 값은 각각 2, 1, 2일 수 있다.In addition, stride values of the first to third sub-convolution operations may be 2, 1, and 2, respectively.

또한, 제3 컨볼루션 연산을 수행하는 단계;를 더 포함하고, 상기 제3 컨볼루션 연산을 수행하는 단계는, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제4 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제4 서브 컨볼루션 연산의 결과와의 제5 서브 컨볼루션 연산을 수행하는 단계, 폭이 1인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제6 서브 컨볼루션 연산을 수행하는 단계 및 상기 제5 및 제6 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.Further, the step of performing a third convolution operation; further comprising, the step of performing the third convolution operation, the difference between l different filters having a width w2 and a result of the second convolution operation. 4 Performing a sub-convolution operation, performing a fifth sub-convolution operation between l different filters having a width w2 and the result of the fourth sub-convolution operation, l different widths of 1 It may include performing a sixth sub-convolution operation between a filter and a result of the second convolution operation, and summing the results of the fifth and sixth sub-convolution operations.

또한, 제4 컨볼루션 연산을 수행하는 단계를 더 포함하고, 상기 제4 컨볼루션 연산을 수행하는 단계는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제7 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제7 서브 컨볼루션 연산의 결과와의 제8 서브 컨볼루션 연산을 수행하는 단계 및 상기 제2 컨볼루션 연산의 결과와 상기 제8 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.In addition, the step of performing a fourth convolution operation further comprises the step of performing the fourth convolution operation, wherein the seventh of m different filters having a width w2 and a result of the second convolution operation Performing a sub-convolution operation, performing an eighth sub-convolution operation of m different filters having a width of w2 and a result of the seventh sub-convolution operation, and a result of the second convolution operation And summing the results of the eighth sub-convolution operation.

또한, 상기 제7 및 제8 서브 컨볼루션 연산의 스트라이드 값은 1일 수 있다.In addition, the stride value of the seventh and eighth sub-convolution operations may be 1.

또한, 상기 입력 피처 맵을 획득하는 단계에서는, 상기 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과로부터 t×1×f (폭×높이×채널) 크기의 입력 피처 맵을 획득할 수 있으며, 여기서, t는 시간, f는 주파수를 의미한다.In addition, in the step of obtaining the input feature map, an input feature map having a size of t×1×f (width×height×channel) may be obtained from the result of MFCC (Mel Frequency Cepstral Coefficient) processing for the input voice, and , Where t is time and f is frequency.

한편, 본 발명에 따른 키워드 스폿팅 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체가 제공될 수 있다.Meanwhile, a computer-readable recording medium in which a program for performing the keyword spotting method according to the present invention is recorded may be provided.

한편, 본 발명의 일 실시예에 따른 키워드 스폿팅 장치는, 인공 신경망을 이용하여 키워드를 스폿팅하는 장치에 있어서, 적어도 하나의 프로그램이 저장된 메모리 및 상기 적어도 하나의 프로그램이 실행됨으로써, 인공 신경망을 이용하여 음 성 키워드를 추출하는 프로세서를 포함하고, 상기 프로세서는, 입력 음성으로부터 입력 피처 맵을 획득하고, 상기 입력 피처 맵과 동일한 채널 길이를 갖되, 폭이 w1 - w1의 길이는 상기 입력 피처 맵의 폭 보다 작은 - 인 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션 연산을 수행하고, 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행하고, 앞선 컨볼루션 연산의 결과를 출력 피처 맵으로 저장하고, 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출한다.Meanwhile, in the keyword spotting apparatus according to an embodiment of the present invention, in the apparatus for spotting keywords using an artificial neural network, a memory in which at least one program is stored and the at least one program are executed, thereby generating an artificial neural network. And a processor for extracting a voice keyword by using, wherein the processor obtains an input feature map from the input voice, and has the same channel length as the input feature map, but the length of the width w1-w1 is the input feature map A first convolution operation with the input feature map is performed for each of n different filters that are smaller than the width of-and the first for each of different filters having the same channel length as the input feature map A second convolution operation is performed with the result of the convolution operation, the result of the previous convolution operation is stored as an output feature map, and the voice keyword is extracted by applying the output feature map to the learned machine learning model.

또한, 상기 제1 컨볼루션 연산의 스트라이드 값은 1일 수 있다.In addition, the stride value of the first convolution operation may be 1.

또한, 상기 제2 컨볼루션 연산을 수행함에 있어서, 상기 프로세서는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제1 서브 컨볼루션 연산을 수행하고, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 서브 컨볼루션 연산의 결과와의 제2 서브 컨볼루션 연산을 수행하고, 폭이 1인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제3 서브 컨볼루션 연산을 수행하고, 상기 제2 및 제3 서브 컨볼루션 연산의 결과를 합산할 수 있다.Further, in performing the second convolution operation, the processor performs a first sub-convolution operation of m different filters having a width of w2 and a result of the first convolution operation, and a width of w2 A second sub-convolution operation is performed with m different filters and a result of the first sub-convolution operation, and a third filter between m different filters having a width of 1 and the result of the first convolution operation is performed. A sub-convolution operation may be performed, and results of the second and third sub-convolution operations may be summed.

또한, 상기 프로세서는 제3 컨볼루션 연산을 더 수행하고, 상기 제3 컨볼루션 연산을 수행함에 있어서, 상기 프로세서는, 폭이 w3 인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제4 서브 컨볼루션 연산을 수행하고, 폭이 w3 인 l 개의 서로 다른 필터와 상기 제4 서브 컨볼루션 연산의 결과와의 제5 서브 컨볼루션 연산을 수행하고, 폭이 1인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제6 서브 컨볼루션 연산을 수행하고, 상기 제5 및 제6 서브 컨볼루션 연산의 결과를 합산할 수 있다.In addition, the processor further performs a third convolution operation, and in performing the third convolution operation, the processor includes l different filters having a width of w3 and a result of the second convolution operation. A fourth sub-convolution operation is performed, l different filters having a width of w3 and a fifth sub-convolution operation with the result of the fourth sub-convolution operation are performed, and l different filters having a width of 1 A sixth sub-convolution operation may be performed with the result of the second convolution operation and the results of the fifth and sixth sub-convolution operations may be summed.

또한, 상기 프로세서는 제4 컨볼루션 연산을 더 수행하고, 상기 제4 컨볼루션 연산을 수행함에 있어서, 상기 프로세서는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제7 서브 컨볼루션 연산을 수행하고, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제7 서브 컨볼루션 연산의 결과와의 제8 서브 컨볼루션 연산을 수행하고, 상기 제2 컨볼루션 연산의 결과와 상기 제8 서브 컨볼루션 연산의 결과를 합산할 수 있다.In addition, the processor further performs a fourth convolution operation, and in performing the fourth convolution operation, the processor includes m different filters having a width w2 and a result of the second convolution operation. A seventh sub-convolution operation is performed, an eighth sub-convolution operation is performed with m different filters having a width of w2 and a result of the seventh sub-convolution operation, and the result of the second convolution operation is The result of the eighth sub-convolution operation may be summed.

또한, 상기 프로세서는, 상기 입력 음성에 대하여 MFCC 처리된 결과로부터 t×1×f (폭×높이×채널) 크기의 상기 입력 피처 맵을 획득할 수 있으며, 여기서, t는 시간, f는 주파수를 의미한다.In addition, the processor may obtain the input feature map of size t×1×f (width×height×channel) from the result of MFCC processing on the input voice, where t is time and f is frequency. it means.

본 발명은 빠르게 높은 정확도로 음성 키워드를 추출할 수 있는 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체를 제공할 수 있다.The present invention can provide a keyword spotting apparatus, a method, and a computer-readable recording medium capable of rapidly extracting voice keywords with high accuracy.

도 1은 일 실시예에 따른 컨볼루션 뉴럴 네트워크를 나타내는 블록도이다.
도 2는 종래기술에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 5는 본 발명의 일 실시예에 따른 컨볼루션 연산 과정을 개략적으로 나타내는 순서도이다.
도 6은 본 발명의 일 실시예에 따른 제1 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.
도 7은 본 발명의 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 8은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 9는 본 발명의 일 실시예에 따른 제2 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.
도 10은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 11은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 12는 본 발명에 따른 키워드 스폿팅 방법의 효과를 설명하기 위한 성능 비교표이다.
도 13은 본 발명의 일 실시예에 따른 키워드 스폿팅 장치의 구성을 개략적으로 나타내는 도면이다.1 is a block diagram illustrating a convolutional neural network according to an embodiment.
2 is a diagram schematically showing a convolution operation method according to the prior art.
3 is a diagram schematically illustrating a convolution operation method according to an embodiment of the present invention.
4 is a flowchart schematically illustrating a keyword spotting method according to an embodiment of the present invention.
5 is a flowchart schematically illustrating a convolution operation process according to an embodiment of the present invention.
6 is a block diagram schematically showing a configuration of a first convolution block according to an embodiment of the present invention.
7 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
8 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
9 is a block diagram schematically showing the configuration of a second convolution block according to an embodiment of the present invention.
10 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
11 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
12 is a performance comparison table for explaining the effect of the keyword spotting method according to the present invention.
13 is a diagram schematically showing a configuration of a keyword spotting apparatus according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention, and a method of achieving them will become apparent with reference to the embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various forms different from each other, and only these embodiments make the disclosure of the present invention complete, and common knowledge in the technical field to which the present invention belongs. It is provided to completely inform the scope of the invention to the possessor, and the invention is only defined by the scope of the claims. The same reference numerals refer to the same elements throughout the specification.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1구성요소는 본 발명의 기술적 사상 내에서 제2구성요소일 수도 있다.Although "first" or "second" is used to describe various elements, these elements are not limited by the terms as described above. The terms as described above may be used only to distinguish one component from another component. Accordingly, the first component mentioned below may be the second component within the technical idea of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.The terms used in the present specification are for explaining examples and are not intended to limit the present invention. In this specification, the singular form also includes the plural form unless specifically stated in the phrase. As used in the specification, “comprises” or “comprising” is implied that the recited component or step does not exclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in the present specification may be interpreted as meanings that can be commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not interpreted ideally or excessively unless explicitly defined specifically.

도 1은 일 실시예에 따른 컨볼루션 뉴럴 네트워크를 나타내는 블록도이다.1 is a block diagram illustrating a convolutional neural network according to an embodiment.

컨볼루션 뉴럴 네트워크(CNN, Convolution Neural Network)는 인공 뉴럴 네트워크(ANN, Artificial Neural Network)의 한 종류이고, 주로 매트릭스(Matrix) 데이터나 이미지 데이터의 특징을 추출하는 데에 이용될 수 있다. 컨볼루션 뉴럴 네트워크는 이력 데이터로부터 특징을 학습하는 알고리즘일 수 있다.A convolution neural network (CNN) is a kind of artificial neural network (ANN), and can be mainly used to extract features of matrix data or image data. The convolutional neural network may be an algorithm that learns features from historical data.

컨볼루션 뉴럴 네트워크 상에서, 프로세서는 제1 컨볼루션 레이어(120)를 통해 입력 이미지(110)에 필터를 적용하여 특징을 획득할 수 있다. 프로세서는 제1 풀링 레이어(130)를 통해 필터 처리된 이미지를 서브 샘플링하여 크기를 줄일 수 있다. 프로세서는 제2 컨볼루션 레이어(140) 및 제2 풀링 레이어(150)를 통해 이미지를 필터 처리하여 특징을 추출하고, 필터 처리된 이미지를 서브 샘플링하여 크기를 줄일 수 있다. 이후에, 프로세서는 히든 레이어(160)를 통해 처리된 이미지를 완전 연결하여 출력 데이터(170)를 획득할 수 있다.On the convolutional neural network, the processor may obtain a feature by applying a filter to the input image 110 through the first convolutional layer 120. The processor may reduce the size by sub-sampling the filtered image through the first pooling layer 130. The processor may filter the image through the second convolution layer 140 and the second pooling layer 150 to extract features, and may reduce the size by sub-sampling the filtered image. Thereafter, the processor may obtain the output data 170 by completely connecting the processed image through the hidden layer 160.

컨볼루션 뉴럴 네트워크에서 컨볼루션 레이어(120, 140)는 3차원 입력 데이터인 입력 액티베이션 데이터(input activation data)와 학습 가능한 파라미터를 나타내는 4차원 데이터인 가중치 데이터(weight data) 간의 컨볼루션 연산을 수행하여 3차원 출력 데이터인 출력 액티베이션 데이터(output activation data)를 획득할 수 있다. 여기서, 획득된 출력 액티베이션 데이터는 다음 레이어에서 입력 액티베이션 데이터로 이용될 수 있다.In a convolutional neural network, the convolutional layers 120 and 140 perform a convolution operation between input activation data, which is 3D input data, and weight data, which is 4D data representing learnable parameters. Output activation data, which is 3D output data, may be obtained. Here, the obtained output activation data may be used as input activation data in a next layer.

한편, 3차원 출력 데이터인 출력 액티베이션 데이터 상의 하나의 픽셀을 연산하는 데에 수천 개의 곱셈과 덧셈 연산이 필요하기 때문에 컨볼루션 뉴럴 네트워크 상에서 데이터가 처리되는 시간의 대부분이 컨볼루션 레이어에서 소요된다.On the other hand, since thousands of multiplication and addition operations are required to calculate one pixel on the output activation data, which is 3D output data, most of the data processing time on the convolutional neural network is spent in the convolutional layer.

도 2는 종래기술에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.2 is a diagram schematically showing a convolution operation method according to the prior art.

먼저 도 2(a)는 음성 키워드 추출의 대상이 되는 입력 음성을 예시적으로 나타낸다. 상기 입력 음성에서 가로 축은 시간(time), 세로 축은 피처(feature)를 나타내는데 상기 입력 음성이 음성 신호인 점을 고려하면, 상기 피처는 주파수(frequency)를 의미하는 것으로 이해할 수 있다. 도 2(a)는 설명의 편의를 위하여 시간의 흐름에 따라 일정한 크기를 갖는 피처 또는 주파수를 도시하고 있으나, 음성은 시간에 따라 주파수가 변화하는 특성을 가지고 있음을 이해하여야 한다. 따라서, 일반적인 경우의 음성 신호는 시간 축에 따라 주파수의 크기가 다른 신호인 것으로 이해할 수 있다.First, FIG. 2(a) exemplarily shows an input voice that is an object of voice keyword extraction. In the input speech, the horizontal axis represents time and the vertical axis represents a feature. Considering that the input speech is a speech signal, the feature can be understood to mean a frequency. Although FIG. 2A illustrates features or frequencies having a constant size over time for convenience of explanation, it should be understood that voice has a characteristic in which the frequency changes with time. Accordingly, it can be understood that the voice signal in a general case is a signal having a different frequency according to the time axis.

한편, 도 2(a)에 도시되는 그래프는 상기 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과일 수 있다. MFCC는 소리의 특징을 추출하는 기법의 하나로, 입력된 소리를 일정 구간(short time)씩 나누어, 각 구간에 대한 스펙트럼 분석을 통해 특징을 추출하는 기법이다. 그리고, 상기 그래프로 표현되는 입력 데이터(I)는 다음과 같이 표시할 수 있다.Meanwhile, the graph shown in FIG. 2A may be a result of MFCC (Mel Frequency Cepstral Coefficient) processing for the input voice. MFCC is one of the techniques for extracting features of sound, and is a technique for extracting features through spectrum analysis for each section by dividing the input sound by a predetermined period (short time). In addition, the input data I represented by the graph can be displayed as follows.

도 2(b)는 도 2(a)에 도시되는 입력 음성으로부터 생성되는 입력 피처 맵(Input Feature Map)과 컨볼루션 필터(K1)를 도시한다. 상기 입력 피처 맵은 도 2(a)에 도시되는 MFCC 데이터로부터 생성될 수 있는데, 도 2(b)에 도시되는 바와 같이, 폭(w, width)×높이(h, height)×채널(c, channel) 방향 기준으로 t×f×1 크기로 생성될 수 있다. 그리고, 상기 입력 피처 맵에 대응하는 입력 데이터(X_2d)는 다음과 같이 표현할 수 있다.FIG. 2(b) shows an input feature map and a convolution filter K1 generated from the input voice shown in FIG. 2(a). The input feature map may be generated from MFCC data shown in FIG. 2(a). As shown in FIG. 2(b), width (w, width) × height (h, height) × channel (c, channel) may be generated with a size of t×f×1 based on the direction. In addition, the input data X _2d corresponding to the input feature map can be expressed as follows.

상기 입력 피처 맵의 채널 방향 길이가 1 이므로, t×f 크기의 2차원 데이터로 이해할 수 있다. 필터(K1)는 상기 입력 피처 맵의 t×f 평면에서 폭 방향, 높이 방향으로 이동하면서 컨볼루션 연산을 수행할 수 있다. 여기서, 필터(K1)의 크기를 3×3으로 가정하면, 필터(K1)에 대응하는 가중치 데이터(W_2d)는 다음과 같이 표현할 수 있다.Since the length of the input feature map in the channel direction is 1, it can be understood as 2D data having a size of t×f. The filter K1 may perform a convolution operation while moving in the width direction and the height direction in the t×f plane of the input feature map. Here, assuming that the size of the filter K1 is 3×3, the weight data W _2d corresponding to the filter K1 can be expressed as follows.

상기 입력 데이터(X_2d)와 상기 가중치 데이터(W_2d) 사이에 컨볼루션 연산을 수행하게 되면 연산 횟수는 3×3×1×f×t 로 표현될 수 있다. 그리고, 필터(K1)의 개수가 c 개라고 가정하면, 연산 횟수는 3×3×1×f×t×c 로 표현될 수 있다.When a convolution operation is performed between the input data X _2d and the weight data W _2d , the number of operations may be expressed as 3×3×1×f×t. And, assuming that the number of filters K1 is c, the number of operations may be expressed as 3×3×1×f×t×c.

도 2(c)는 상기 입력 데이터(X_2d)와 상기 가중치 데이터(W_2d) 사이에 컨볼루션 연산의 결과로 획득되는 출력 피처 맵(Output Feature Map)을 예시적으로 나타낸다. 앞서, 필터(K1)의 개수를 c 개로 가정하였으므로 상기 출력 피처 맵의 채널 방향 길이는 c 가 되는 것을 알 수 있다. 따라서, 상기 출력 피처 맵에 대응하는 데이터는 다음과 같이 표현할 수 있다.FIG. 2(c) exemplarily shows an output feature map obtained as a result of a convolution operation between _{the input data X 2d} and the weight data W _2d. Previously, since the number of filters K1 was assumed to be c, it can be seen that the length of the output feature map in the channel direction is c. Accordingly, data corresponding to the output feature map can be expressed as follows.

컨볼루션 뉴럴 네트워크는 낮은 레벨(low-level)에서 높은 레벨(high-level)로의 연속적인 변환을 수행하는 뉴럴 네트워크로 알려져 있다. 한편, 현대의 컨볼루션 뉴럴 네트워크에서는 작은 크기의 필터를 이용하기 때문에, 상대적으로 얕은(shallow) 네트워크에서는 저 주파수(low-frequency)와 고 주파수(high-frequency) 모두로부터 유용한 특징을 추출하기 어려울 수 있다.The convolutional neural network is known as a neural network that performs continuous conversion from a low-level to a high-level. On the other hand, since modern convolutional neural networks use small-sized filters, it may be difficult to extract useful features from both low-frequency and high-frequency in a relatively shallow network. have.

컨볼루션 뉴럴 네트워크를 통해 이미지 분석을 수행하는 경우, 이미지를 구성하는 복수의 픽셀들 중 인접한 픽셀들은 유사한 특징을 가질 확률이 높고, 멀리 떨어진 위치의 픽셀들은 특성이 상이할 확률이 상대적으로 높다. 따라서, 이미지 분석 및 학습에 있어서 컨볼루션 뉴럴 네트워크는 좋은 도구가 될 수 있다.When image analysis is performed through a convolutional neural network, adjacent pixels among a plurality of pixels constituting an image have a high probability of having similar characteristics, and pixels located farther away have a relatively high probability of having different characteristics. Therefore, convolutional neural networks can be a good tool for image analysis and learning.

반면에, 이미지 분석 및 학습에 이용되는 종래의 컨볼루션 뉴럴 네트워크를 음성 분석 및 학습에 적용하는 경우 효율이 떨어질 수 있다. 음성 신호는 시간의 흐름에 따라 주파수의 크기가 변화하는데 시간적으로 인접한 음성 신호라 하더라도 주파수의 크기 차이가 클 수 있기 때문이다.On the other hand, when a conventional convolutional neural network used for image analysis and learning is applied to speech analysis and learning, efficiency may be degraded. The size of the frequency of the voice signal changes with the passage of time, because even if the voice signal is temporally adjacent, the difference in frequency may be large.

도 3은 본 발명의 일 실시예에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.3 is a diagram schematically illustrating a convolution operation method according to an embodiment of the present invention.

도 3(a)는 본 발명의 일 실시예에 따른 입력 피처 맵(Input Feature Map)과 필터(K)를 도시한다. 도 3(a)에 도시되는 입력 피처 맵은 도 2(b)의 입력 피처 맵과 실질적으로 크기는 동일하나 높이(height)가 1 이고, 채널 방향의 길이가 f 로 정의될 수 있다. 한편, 본 발명에 따른 필터(K)의 채널 방향의 길이는 상기 입력 피처 맵의 채널 방향의 길이와 동일하다. 따라서, 하나의 필터(K)만으로도 입력 피처 맵에 포함된 모든 주파수 데이터를 커버(cover)할 수 있다.3(a) shows an input feature map and a filter K according to an embodiment of the present invention. The input feature map shown in FIG. 3(a) is substantially the same size as the input feature map of FIG. 2(b), but has a height of 1, and a length in a channel direction may be defined as f. Meanwhile, the length in the channel direction of the filter K according to the present invention is the same as the length in the channel direction of the input feature map. Therefore, it is possible to cover all frequency data included in the input feature map with only one filter K.

한편, 도 3(a)의 입력 피처 맵, 및 필터(K)에 대응하는 입력 데이터(X_1d)와 가중치 데이터(W_1d)는 각각 다음과 같이 표현할 수 있다.Meanwhile, the input feature map of FIG. 3A and the input data X _1d and the weight data W _1d corresponding to the filter K may be expressed as follows.

입력 피처 맵과 필터(K)의 채널 방향 길이가 동일하므로 필터(K)는 폭(w) 방향으로만 이동하면서 컨볼루션 연산을 수행하면 되므로, 도 2를 통해 설명한 데이터와 달리 본 발명에 따른 컨볼루션 연산은 1차원 연산으로 이해할 수 있다.Since the input feature map and the filter K have the same length in the channel direction, the filter K only needs to perform a convolution operation while moving in the width w direction. A lusion operation can be understood as a one-dimensional operation.

한편, 입력 피처 맵의 형태가 도 2(b)의 입력 피처 맵과 동일한 경우, 필터(K)의 채널 방향의 길이는 1, 높이가 f 로 정의될 수 있다. 따라서, 본 발명에 따른 필터(K)는 채널 방향의 길이로 제한되지 않는다. 상기 입력 피처 맵의 높이 또는 채널 방향의 길이가 1일 수 있으므로, 본 발명에 따른 필터(K)는 폭 방향 길이를 제외한 나머지 방향의 길이가 입력 피처 맵에 대응하는 방향의 길이와 일치하는 것을 특징으로 한다.Meanwhile, when the shape of the input feature map is the same as the input feature map of FIG. 2B, the length of the filter K in the channel direction may be defined as 1 and the height may be defined as f. Therefore, the filter K according to the present invention is not limited to the length in the channel direction. Since the height of the input feature map or the length in the channel direction may be 1, the filter K according to the present invention is characterized in that the length in the other direction excluding the length in the width direction matches the length in the direction corresponding to the input feature map. It is done.

상기 입력 데이터(X_1d)와 상기 가중치 데이터(W_1d) 사이에 컨볼루션 연산을 수행하게 되면 연산 횟수는 3×1×f×t×1 로 표현될 수 있다. 그리고, 필터(K)의 개수가 c` 개라고 가정하면, 연산 횟수는 3×1×f×t×1×c` 로 표현될 수 있다.When a convolution operation is performed between the input data X _1d and the weight data W _1d , the number of operations may be expressed as 3×1×f×t×1. And, assuming that the number of filters K is c`, the number of operations may be expressed as 3×1×f×t×1×c′.

도 3(b)는 상기 입력 데이터(X_1d)와 상기 가중치 데이터(W_1d) 사이에 컨볼루션 연산의 결과로 획득되는 출력 피처 맵(Output Feature Map)을 예시적으로 나타낸다. 앞서, 필터(K)의 개수를 c` 개로 가정하였으므로 상기 출력 피처 맵의 채널 방향 길이는 c` 가 되는 것을 알 수 있다. 따라서, 상기 출력 피처 맵에 대응하는 데이터는 다음과 같이 표현할 수 있다.3(b) exemplarily shows an output feature map obtained as a result of a convolution operation between _{the input data X 1d} and the weight data W _1d. Previously, since the number of filters K was assumed to be c`, it can be seen that the length of the output feature map in the channel direction is c`. Accordingly, data corresponding to the output feature map can be expressed as follows.

종래기술에 사용되는 필터(K1)의 크기와 본 발명에 따른 필터(K)의 크기를 비교하면, 본 발명에 따른 필터(K)의 크기가 더 크고 하나의 필터로 컨볼루션 연산이 수행되는 데이터의 개수도 본 발명에 따른 필터(K)가 더 많다. 따라서, 본 발명에 따른 컨볼루션 연산은 종래기술에 비하여 더 적은 수의 필터만으로도 컨볼루션 연산을 수행할 수 있다.When comparing the size of the filter K1 used in the prior art and the size of the filter K according to the present invention, the size of the filter K according to the present invention is larger and data for which a convolution operation is performed with one filter. The number of filters (K) according to the present invention is also larger. Accordingly, the convolution operation according to the present invention can perform the convolution operation with only a smaller number of filters than in the prior art.

즉, c > c` 의 관계가 성립하므로 전체 컨볼루션 연산 횟수는 본 발명에 따른 컨볼루션 연산 방법을 이용할 때 더 줄어들게 된다.That is, since the relationship c> c` is established, the total number of convolution operations is further reduced when the convolution operation method according to the present invention is used.

도 4는 본 발명의 일 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.4 is a flowchart schematically illustrating a keyword spotting method according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 일 실시예에 따른 키워드 스폿팅 방법은, 입력 음성으로부터 입력 피처 맵을 획득하는 단계(S110), 제1 컨볼루션 연산을 수행하는 단계(S120), 제2 컨볼루션 연산을 수행하는 단계(S130), 출력 피처 맵을 저장하는 단계(S140), 및 음성 키워드를 추출하는 단계(S150)를 포함한다.Referring to FIG. 4, the keyword spotting method according to an embodiment of the present invention includes obtaining an input feature map from an input voice (S110), performing a first convolution operation (S120), and a second convolution. It includes performing a lution operation (S130), storing an output feature map (S140), and extracting a voice keyword (S150).

본 발명에 따른 키워드 스폿팅 방법은, 인공 신경망(Artificial Neural Network)를 이용한 키워드 스폿팅 방법으로, 단계(S110)에서는 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득한다. 음성 키워드 추출은 음성 신호 스트림(audio signal stream)으로부터 미리 정의된 키워드를 추출하는 것을 목적으로 한다. 예컨대, 사람의 음성으로부터 "Hey Siri", "Okay Google" 등의 키워드를 식별하여 기기의 wake-up 언어로 사용할 때에도 키워드 스폿팅 방법이 사용될 수 있다.The keyword spotting method according to the present invention is a keyword spotting method using an artificial neural network. In step S110, an input feature map is obtained from an input voice. Voice keyword extraction aims to extract predefined keywords from an audio signal stream. For example, a keyword spotting method may also be used when a keyword such as "Hey Siri" or "Okay Google" is identified from a human voice and used as a wake-up language of the device.

단계(S110)에서는 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과로부터 t×1×f (폭×높이×채널) 크기의 입력 피처 맵을 획득할 수 있다. 여기서, t는 시간, f는 주파수를 의미한다.In step S110, an input feature map having a size of t×1×f (width×height×channel) may be obtained from a result of MFCC (Mel Frequency Cepstral Coefficient) processing on the input voice. Here, t is time and f is frequency.

상기 입력 피처 맵은 일정 시간(t) 동안 획득된 입력 음성을 시간(t)에 대한 주파수(frequency)로 표현할 수 있다. 그리고, 상기 입력 피처 맵에서 각각의 주파수 데이터는 플로팅 포인트(floating point)로 표현될 수 있다.The input feature map may express an input voice acquired for a predetermined time (t) as a frequency with respect to time (t). In addition, each frequency data in the input feature map may be expressed as a floating point.

단계(S120)에서는 상기 입력 피처 맵과 동일한 채널(channel) 길이를 갖는 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션(convolution) 연산을 수행한다. 상기 n 개의 서로 다른 필터는 각각 서로 다른 음성을 판별할 수 있다. 예컨대, 제1 필터는 'a' 에 대응하는 음성을 판별하고, 제2 필터는 'o' 에 대응하는 음성을 판별할 수 있다. 상기 n 개의 서로 다른 필터와 컨볼루션 연산을 거치면서 각 필터의 특성에 대응하는 음성들은 그 소리의 특징이 더 강화될 수 있다.In step S120, a first convolution operation with the input feature map is performed on each of n different filters having the same channel length as the input feature map. Each of the n different filters may discriminate different voices. For example, the first filter may determine a voice corresponding to'a', and the second filter may determine a voice corresponding to'o'. Voices corresponding to the characteristics of each filter may be further enhanced through the n different filters and convolution operations.

상기 n 개의 필터들의 폭은 w1 로 정의될 수 있는데, 상기 입력 피처 맵의 폭이 w 라 하면 w > w1 의 관계가 성립할 수 있다. 예컨대, 도 3(a)에서 입력 피처 맵의 폭은 t 이고 필터(K1)의 폭은 3 일 수 있다. 한편, 상기 제1 컨볼루션 연산에서 상기 필터들의 개수가 n 개인 경우 총 n 개의 출력이 존재하게 된다. 컨볼루션 연산은 CPU와 같은 프로세서를 통해 수행될 수 있고, 컨볼루션 연산의 결과는 메모리 등에 저장될 수 있다.The width of the n filters may be defined as w1. If the width of the input feature map is w, the relationship w>w1 may be established. For example, in FIG. 3A, the width of the input feature map may be t and the width of the filter K1 may be 3. Meanwhile, when the number of filters is n in the first convolution operation, there are a total of n outputs. The convolution operation may be performed through a processor such as a CPU, and the result of the convolution operation may be stored in a memory or the like.

단계(S120)에서 수행되는 상기 제1 컨볼루션 연산에서는 '1' 의 스트라이드(stride) 값이 적용될 수 있다. 스트라이드는 컨볼루션 연산에서 필터가 입력 데이터를 건너뛰는 간격을 의미하는데, 스트라이드 값으로 '1' 이 적용되면 건너뛰는 데이터 없이 모든 입력 데이터와 컨볼루션 연산을 수행하게 된다.In the first convolution operation performed in step S120, a stride value of '1' may be applied. The stride refers to the interval at which the filter skips input data in the convolution operation. If '1' is applied as the stride value, all input data and convolution operations are performed without skipped data.

단계(S130)에서는 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행한다. 상기 제2 컨볼루션 연산에 사용되는 필터의 개수는 m 개일 수 있으며, 필터의 폭은 w2 로 정의될 수 있다. 이 때, 상기 제2 컨볼루션 연산에 사용되는 필터의 폭과 상기 제1 컨볼루션 연산에 사용되는 필터의 폭은 서로 같을 수도 있고, 서로 다를 수도 있다.In step S130, a second convolution operation with the result of the first convolution operation is performed on each of different filters having the same channel length as the input feature map. The number of filters used in the second convolution operation may be m, and the width of the filter may be defined as w2. In this case, the width of the filter used for the second convolution operation and the width of the filter used for the first convolution operation may be the same or different from each other.

다만, 상기 제2 컨볼루션 연산에 사용되는 필터의 채널 방향의 길이는, 상기 제1 컨볼루션 연산에 사용되는 필터의 채널 방향의 길이와 마찬가지로 상기 입력 피처 맵의 채널 방향의 길이와 동일하다.However, the length in the channel direction of the filter used for the second convolution operation is the same as the length in the channel direction of the filter used for the first convolution operation in the channel direction of the input feature map.

상기 제2 컨볼루션 연산에 사용되는 m 개의 서로 다른 필터는, 상기 제1 컨볼루션 연산에 사용되는 n 개의 서로 다른 필터와 마찬가지로 각각 서로 다른 음성을 판별할 수 있다.Each of the m different filters used in the second convolution operation may discriminate different voices, similar to n different filters used in the first convolution operation.

단계(S140)에서는 앞선 컨볼루션 연산의 결과를 출력 피처 맵(Output Feature Map)으로 저장한다. 상기 출력 피처 맵은 단계(S150)에서 음성 키워드를 추출하기 전에 최종적으로 획득되는 결과로서, 도 3(b)와 같은 형태를 가질 수 있다. 단계(S140)에서 저장되는 출력 피처 맵은 바로 앞 단계에서 수행된 컨볼루션 연산의 결과로 이해할 수 있다. 따라서, 단계(S140)에서는 단계(S130)에서 수행된 상기 제2 컨볼루션 연산의 결과를 상기 출력 피처 맵으로 저장할 수 있다.In step S140, the result of the previous convolution operation is stored as an output feature map. The output feature map is a result finally obtained before extracting the voice keyword in step S150, and may have a form as shown in FIG. 3(b). The output feature map stored in step S140 may be understood as a result of the convolution operation performed in the preceding step. Accordingly, in step S140, the result of the second convolution operation performed in step S130 may be stored as the output feature map.

단계(S150)에서는 상기 출력 피처 맵을 학습된 기계학습 모델(Machine Learning Model)에 적용하여 음성 키워드를 추출한다. 상기 기계학습 모델은 풀링 레이어(Pooling Layer), 풀 커넥트 레이어(Full-Connect Layer), 소프트맥스(Softmax) 연산 등을 포함할 수 있는데, 이에 대해서는 이어지는 도면을 참조하여 구체적으로 설명하도록 한다.In step S150, a voice keyword is extracted by applying the output feature map to a learned machine learning model. The machine learning model may include a pooling layer, a full-connect layer, a softmax operation, and the like, which will be described in detail with reference to the following drawings.

한편, 본 명세서에서 설명되는 컨볼루션 연산에 이어 풀링 연산이 수행될 수 있으며, 풀링 연산 과정에서 발생할 수 있는 데이터 감소를 방지하기 위해 제로 패딩(Zero Padding) 기법이 적용될 수 있다.Meanwhile, a pooling operation may be performed following the convolution operation described herein, and a zero padding technique may be applied to prevent data reduction that may occur during the pooling operation.

도 5는 본 발명의 일 실시예에 따른 컨볼루션 연산 과정을 개략적으로 나타내는 순서도이다.5 is a flowchart schematically illustrating a convolution operation process according to an embodiment of the present invention.

도 5를 참조하면, 제1 컨볼루션 연산을 수행하는 단계(S120)와 제2 컨볼루션 연산을 수행하는 단계(S130)가 보다 구체적으로 도시된다. 제1 컨볼루션 연산이 수행되는 단계에서는 입력 피처 맵과 필터 사이에 컨볼루션 연산이 수행된다. 도 5에 도시되는 실시예에서 상기 필터는 3×1 크기를 갖는데, 이는 폭이 3, 높이가 1 인 필터를 의미한다. 또한, 도 4를 참조로 하여 설명한 바와 같이, 상기 필터의 채널 방향 길이는 상기 입력 피처 맵의 채널 방향 길이와 동일하다.Referring to FIG. 5, an operation S120 of performing a first convolution operation and an operation S130 of performing a second convolution operation are shown in more detail. In the step in which the first convolution operation is performed, a convolution operation is performed between the input feature map and the filter. In the embodiment shown in FIG. 5, the filter has a size of 3×1, which means a filter having a width of 3 and a height of 1. In addition, as described with reference to FIG. 4, the length of the filter in the channel direction is the same as the length of the input feature map in the channel direction.

한편, 제1 컨볼루션 연산의 스트라이드 값은 1이 적용되며, 상기 필터의 개수는 16k 일 수 있다. 여기서 k는 채널 개수에 대한 멀티플라이어(multiplier)로서, k가 1인 경우에는 총 16개의 서로 다른 필터와 상기 입력 피처 맵 사이에 컨볼루션 연산이 수행되는 것으로 이해할 수 있다. 즉, 상기 k 값이 1인 경우, n 값은 16이 된다.Meanwhile, 1 is applied to the stride value of the first convolution operation, and the number of filters may be 16k. Here, k is a multiplier for the number of channels, and when k is 1, it can be understood that a convolution operation is performed between a total of 16 different filters and the input feature map. That is, when the k value is 1, the n value is 16.

제2 컨볼루션 연산이 수행되는 단계에서는 상기 제1 컨볼루션 연산의 결과와 제1 컨볼루션 블록 사이에 컨볼루션 연산이 수행될 수 있다. 제2 컨볼루션 연산은 상기 제1 컨볼루션 연산이 수행되는 단계에서 사용되는 필터와 서로 다른 폭을 갖는 필터들이 사용될 수 있다. 또한, 도 5에 도시되는 실시예에서 제2 컨볼루션 연산의 스트라이드 값은 2가 적용될 수 있다. 그리고, 상기 제1 컨볼루션 연산에서는 16k 개의 서로 다른 필터가 사용되나, 제2 컨볼루션 연산에서는 24k 개의 서로 다른 필터가 사용될 수 있다. 따라서, 제2 컨볼루션 연산에서는 제1 컨볼루션 연산에서보다 1.5배의 필터가 사용되는 것으로 이해할 수 있다. 즉, k 값이 1인 경우, m 값은 24가 된다.In the step of performing the second convolution operation, a convolution operation may be performed between the result of the first convolution operation and the first convolution block. For the second convolution operation, filters having a different width from the filter used in the step in which the first convolution operation is performed may be used. In addition, in the embodiment shown in FIG. 5, 2 may be applied as the stride value of the second convolution operation. In addition, 16k different filters may be used in the first convolution operation, but 24k different filters may be used in the second convolution operation. Therefore, it can be understood that a filter of 1.5 times larger than that in the first convolution operation is used in the second convolution operation. That is, when the k value is 1, the m value is 24.

상기 제2 컨볼루션 연산이 수행되면, 풀링(pooling) 레이어, 풀 커넥트(full connect) 레이어, 및 소프트맥스(softmax) 연산을 거친 후 최종 결과가 추출될 수 있다.When the second convolution operation is performed, a final result may be extracted after performing a pooling layer, a full connect layer, and a softmax operation.

도 5의 제1 및 제2 컨볼루션 연산, 풀링, 풀 커넥트, 및 소프트맥스로 구성되는 하나의 모델이 본 발명의 일 실시예에 따른 인공 신경망 모델로 이해될 수 있다.One model including first and second convolution operations, pooling, full connect, and softmax of FIG. 5 may be understood as an artificial neural network model according to an embodiment of the present invention.

도 6은 본 발명의 일 실시예에 따른 제1 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.6 is a block diagram schematically showing a configuration of a first convolution block according to an embodiment of the present invention.

상기 제1 컨볼루션 블록은 도 4 및 도 5를 참조로 하여 설명한 제2 컨볼루션 연산에 사용되는 컨볼루션 블록이다. 상기 제1 컨볼루션 블록을 사용한 제2 컨볼루션 연산 단계(S130)는, 폭이 w2 인 m 개의 서로 다른 필터와 제1 컨볼루션 연산의 결과와의 제1 서브 컨볼루션 연산을 수행하는 단계(S131)를 포함한다. 이 때 사용되는 필터의 폭은 제1 컨볼루션 연산 또는 제2 컨볼루션 연산에 사용되는 필터의 폭과 다를 수 있다. 도 6에 도시되는 일 실시예에서는, 단계(S131)에서 사용되는 필터의 폭, 즉 w2 값은 9일 수 있다. 또한, 도 5를 참조로 설명한 바와 같이, 단계(S131)에서 사용되는 필터의 개수는 16k 개이고, 총 16k 번의 컨볼루션 연산이 수행될 수 있다. 한편, 단계(S131)에서는 스트라이드 값으로 2 가 적용될 수 있다.The first convolution block is a convolution block used in the second convolution operation described with reference to FIGS. 4 and 5. In the second convolution operation step (S130) using the first convolution block, performing a first sub-convolution operation between m different filters having a width of w2 and a result of the first convolution operation (S131). ). In this case, the width of the filter used may be different from the width of the filter used for the first convolution operation or the second convolution operation. In the exemplary embodiment illustrated in FIG. 6, the width of the filter used in step S131, that is, the w2 value may be 9. In addition, as described with reference to FIG. 5, the number of filters used in step S131 is 16k, and a total of 16k convolution operations may be performed. Meanwhile, in step S131, 2 may be applied as the stride value.

단계(S131)가 수행되면 컨볼루션 결과에 배치 노멀라이제이션(Batch Normalization)과 액티베이션 함수(Actication Function)가 적용될 수 있다. 도 6에서는 상기 액티베이션 함수로 ReLU(Rectified Linear Unit)를 사용하는 실시예가 개시된다. Batch Normalization은 컨볼루션 연산이 반복 수행됨에 따라 발생할 수 있는 Gradient Vanishing 또는 Gradient Exploding 문제를 해결하기 위해 적용될 수 있다. Activation Function 또한 Gradient Vanishing 또는 Gradient Exploding 문제를 해결하기 위해 적용될 수 있으며, Over Fitting 문제를 해결할 수 있는 수단이 될 수 있다. Activation Function 에는 다양한 종류의 함수가 존재하는데, 본 발명에서는 일 실시예로 ReLU 함수를 적용하는 방법을 개시한다. Batch Normalization 과정을 통해 컨볼루션 데이터가 정규화되면, 정규화된 데이터에 ReLU 함수가 적용될 수 있다.When step S131 is performed, batch normalization and activation function may be applied to the convolution result. In FIG. 6, an embodiment using ReLU (Rectified Linear Unit) as the activation function is disclosed. Batch Normalization can be applied to solve a gradient vanishing or gradient exploding problem that may occur as a convolution operation is repeatedly performed. Activation Function can also be applied to solve the gradient vanishing or gradient exploding problem, and can be a means to solve the over fitting problem. Various types of functions exist in the activation function, and the present invention discloses a method of applying the ReLU function as an embodiment. When convolutional data is normalized through the batch normalization process, the ReLU function can be applied to the normalized data.

Batch Normalization과 Activation Function은 여러 단계의 컨볼루션 레이어를 거치면서 발생할 수 있는 Gradient Vanishing(or Exploding), Over Fitting 문제를 해결하기 위해 사용되는 것으로, 이러한 문제를 해결할 수 있는 다른 수단이 존재하는 경우, 본 발명의 실시예는 다양하게 변경될 수 있다.Batch Normalization and Activation Function are used to solve Gradient Vanishing (or Exploding) and Over Fitting problems that may occur while going through multiple stages of convolution. The embodiments of the invention can be variously changed.

Batch Normalization과 Activation Function 적용이 완료되면, 제2 서브 컨볼루션 연산을 수행할 수 있다(S132). 단계(S132)에서는 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 서브 컨볼루션 연산의 결과와의 제2 서브 컨볼루션 연산이 수행된다. 이 때 사용되는 필터의 폭은 제1 컨볼루션 연산 또는 제2 컨볼루션 연산에 사용되는 필터의 폭과 다를 수 있다. 도 6에 도시되는 일 실시예에서는, 단계(S132)에서 사용되는 필터의 폭은 9일 수 있다. 또한, 도 5를 참조로 설명한 바와 같이, 단계(S132)에서 사용되는 필터의 개수는 16k 개이고, 총 16k 번의 컨볼루션 연산이 수행될 수 있다. 한편, 단계(S132)에서는 스트라이드 값으로 1이 적용될 수 있다. 한편, 단계(S132)에서 제2 서브 컨볼루션 연산이 수행되면, Batch Normalization이 수행될 수 있다.When batch normalization and application of the activation function are completed, a second sub-convolution operation may be performed (S132). In step S132, a second sub-convolution operation is performed between m different filters having a width of w2 and a result of the first sub-convolution operation. In this case, the width of the filter used may be different from the width of the filter used for the first convolution operation or the second convolution operation. In the exemplary embodiment illustrated in FIG. 6, the width of the filter used in step S132 may be 9. In addition, as described with reference to FIG. 5, the number of filters used in step S132 is 16k, and a total of 16k convolution operations may be performed. Meanwhile, in step S132, 1 may be applied as the stride value. Meanwhile, when the second sub-convolution operation is performed in step S132, batch normalization may be performed.

단계(S133)에서는 제3 서브 컨볼루션 연산이 수행될 수 있다. 제3 서브 컨볼루션 연산은 폭이 1인 m 개의 서로 다른 필터와 상기 단계(S120)에서 수행되는 컨볼루션 연산의 결과와의 컨볼루션 연산이 수행된다. 그리고, 이 때 스트라이드 값으로는 2가 적용될 수 있다. 단계(S133)에서 제3 서브 컨볼루션 연산이 수행되면, 이후 Batch Normalization과 Activation Function이 적용될 수 있다.In step S133, a third sub-convolution operation may be performed. In the third sub-convolution operation, a convolution operation is performed between m different filters having a width of 1 and the result of the convolution operation performed in step S120. In this case, 2 may be applied as the stride value. If the third sub-convolution operation is performed in step S133, then batch normalization and activation function may be applied.

한편, 단계(S130)는 단계(S132)에서 수행되는 제2 서브 컨볼루션 연산의 결과와 단계(S133)에서 수행되는 제3 서브 컨볼루션 연산의 결과를 합산하는 단계를 더 포함할 수 있다. 도 6에 도시되는 바와 같이, 상기 제2 및 제3 서브 컨볼루션 연산의 결과가 합산되고, 합산된 결과에 최종적으로 Activation Function이 적용되면 다음 단계, 예컨대 Pooling 레이어로 데이터가 전달될 수 있다.Meanwhile, step S130 may further include summing the result of the second sub-convolution operation performed in step S132 and the result of the third sub-convolution operation performed in step S133. As illustrated in FIG. 6, when the results of the second and third sub-convolution operations are summed, and an activation function is finally applied to the summed result, data may be transferred to a next step, for example, a pooling layer.

도 7은 본 발명의 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.7 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 7을 참조하면, 본 발명의 다른 실시예에 따른 키워드 스폿팅 방법은, 제1 컨볼루션 연산을 수행하는 단계(S220), 제2 컨볼루션 연산을 수행하는 단계(S230), 및 제3 컨볼루션 연산을 수행하는 단계(S240)를 포함한다. 단계(S220), 및 단계(S230)에서는 도 4 내지 도 6을 참조로 하여 설명한 제1 컨볼루션 연산 및 제2 컨볼루션 연산이 수행될 수 있다.Referring to FIG. 7, in a keyword spotting method according to another embodiment of the present invention, a step of performing a first convolution operation (S220), a step of performing a second convolution operation (S230), and a third convolution operation It includes a step (S240) of performing a lution operation. In step S220 and step S230, the first convolution operation and the second convolution operation described with reference to FIGS. 4 to 6 may be performed.

단계(S240)에서는 도 6에 도시되는 구성을 포함하는 제1 컨볼루션 블록을 사용하여 상기 제3 서브 컨볼루션 연산을 수행할 수 있다. 도 6과 도 7을 함께 참조하여 설명하면, 상기 제3 서브 컨볼루션 연산을 수행하는 단계(S240)는, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제4 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제4 서브 컨볼루션 연산의 결과와의 제5 서브 컨볼루션 연산을 수행하는 단계, 폭이 1인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제6 서브 컨볼루션 연산을 수행하는 단계, 및 상기 제5 및 제6 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.In step S240, the third sub-convolution operation may be performed using a first convolution block including the configuration shown in FIG. 6. Referring to FIGS. 6 and 7 together, the step of performing the third sub-convolution operation (S240) includes l different filters having a width w2 and a fourth result of the second convolution operation. Performing a sub-convolution operation, performing a fifth sub-convolution operation on l different filters having a width w2 and a result of the fourth sub-convolution operation, l different filters having a width of 1 And performing a sixth sub-convolution operation with the result of the second convolution operation, and summing the results of the fifth and sixth sub-convolution operations.

즉, 단계(S240)에서는 단계(S230)에서와 동일한 형태의 컨볼루션 연산이 한 번 더 수행되는 것으로 이해할 수 있다. 다만, 단계(S240)에서 사용되는 필터의 개수는 단계(S230)에서 사용되는 필터의 개수와 다를 수 있다. 도 7을 참조하면, 단계(S230)에서는 24k 개의 필터가 사용되고, 단계(S240)에서는 32k 개의 필터가 사용될 수 있음을 알 수 있다.That is, in step S240, it can be understood that the same type of convolution operation as in step S230 is performed once more. However, the number of filters used in step S240 may be different from the number of filters used in step S230. Referring to FIG. 7, it can be seen that 24k filters may be used in step S230 and 32k filters may be used in step S240.

도 8은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.8 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 8을 참조하면, 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법은, 제1 컨볼루션 연산을 수행하는 단계(S320), 제2 컨볼루션 연산을 수행하는 단계(S330), 및 제4 컨볼루션 연산을 수행하는 단계(S340)를 포함할 수 있다. 단계(S340)에서는 제2 컨볼루션 블록을 이용하여 제4 컨볼루션 연산을 수행할 수 있다.Referring to FIG. 8, a keyword spotting method according to another embodiment of the present invention includes performing a first convolution operation (S320), performing a second convolution operation (S330), and a fourth method. It may include performing a convolution operation (S340). In step S340, a fourth convolution operation may be performed using the second convolution block.

단계(S330)에서 사용되는 필터의 개수와 단계(S340)에서 사용되는 필터의 개수는 24k 개로 동일할 수 있는데, 필터의 개수가 동일할 뿐 필터의 크기와 필터에 포함되는 데이터(예컨대, weight activation)의 값은 서로 다를 수 있다.The number of filters used in step S330 and the number of filters used in step S340 may be the same as 24k, but the number of filters is the same, but the size of the filter and the data included in the filter (e.g., weight activation ) Can be different.

도 9는 본 발명의 일 실시예에 따른 제2 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.9 is a block diagram schematically showing the configuration of a second convolution block according to an embodiment of the present invention.

상기 제2 컨볼루션 블록은 제4 컨볼루션 연산을 수행하는 단계(S340)에서 수행될 수 있다. 도 9를 참조하면, 단계(S340)는 폭이 w2 인 m 개의 서로 다른 필터와 제2 컨볼루션 연산의 결과와의 제7 서브 컨볼루션 연산을 수행하는 단계(S341), 폭이 w2 인 m 개의 서로 다른 필터와 상기 제7 서브 컨볼루션 연산의 결과와의 제8 서브 컨볼루션 연산을 수행하는 단계(S342)를 포함할 수 있다.The second convolution block may be performed in step S340 of performing a fourth convolution operation. Referring to FIG. 9, step S340 is a step (S341) of performing a seventh sub-convolution operation between m different filters having a width w2 and a result of the second convolution operation (S341), and m number of filters having a width w2 It may include performing an eighth sub-convolution operation with different filters and a result of the seventh sub-convolution operation (S342).

도 9에서 단계(S341)와 단계(S342)에서 사용되는 필터의 폭은 '9'로 설정될 수 있으며 스트라이드 값은 '1'로 설정될 수 있다. 또한, 단계(S341)에서 수행된 제7 서브 컨볼루션 연산의 결과에 Batch Normalization이 적용될 수 있고, 이후에 ReLU 함수가 Activation Function으로 적용될 수 있다.In FIG. 9, the width of the filter used in steps S341 and S342 may be set to '9' and the stride value may be set to '1'. In addition, batch normalization may be applied to the result of the seventh sub-convolution operation performed in step S341, and then the ReLU function may be applied as an activation function.

단계(S342)에서 수행되는 제8 서브 컨볼루션 연산의 결과에도 Batch Normalization이 적용될 수 있다. 그리고, 단계(S330)에서 수행된 상기 제2 컨볼루션 연산의 결과와 상기 제8 서브 컨볼루션 연산의 결과를 합산하는 단계가 수행될 수 있다.Batch normalization may also be applied to the result of the eighth sub-convolution operation performed in step S342. In addition, a step of summing the result of the second convolution operation performed in step S330 and the result of the eighth sub-convolution operation may be performed.

단계(S330)에서 수행되는 상기 제2 컨볼루션 연산의 결과는 추가적인 컨볼루션 연산 없이 상기 제8 서브 컨볼루션 연산의 결과와 합산되는데, 이에 대응하는 경로를 Identity Shortcut으로 정의할 수 있다. 제1 컨볼루션 블록과 다르게 Identity Shortcut이 적용되는 것은 상기 제2 컨볼루션 블록에 포함되는 제7 및 제8 서브 컨볼루션 연산에 적용되는 스트라이드 값이 모두 1이기 때문에 컨볼루션 연산 과정에서 차원(dimension) 변화가 일어나지 않기 때문이다.The result of the second convolution operation performed in step S330 is summed with the result of the eighth sub-convolution operation without an additional convolution operation, and a path corresponding thereto may be defined as an Identity Shortcut. Unlike the first convolution block, the Identity Shortcut is applied because the stride values applied to the seventh and eighth sub-convolution operations included in the second convolution block are all 1, so the dimension in the convolution operation process Because no change occurs.

한편, 상기 합산 결과에 대해 ReLU 함수가 Activation Function으로 적용된 데이터는 다음 단계, 예컨대 Pooling 단계로 전달될 수 있다.Meanwhile, data to which the ReLU function is applied as an activation function for the summation result may be transferred to a next step, for example, a pooling step.

도 10은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.10 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 10을 참조하면, 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법은 도 4의 단계(S120)와 같은 제1 컨볼루션 연산이 수행되는 단계와, 제1 컨볼루션 블록을 사용하는 컨볼루션 연산이 세 번 연속되는 단계를 포함할 수 있다.Referring to FIG. 10, in a keyword spotting method according to another embodiment of the present invention, a step of performing a first convolution operation as in step S120 of FIG. 4 and a convolution using a first convolution block It may include a step in which the operation is consecutive three times.

상기 제1 컨볼루션 블록의 구성은 도 6을 참조하도록 한다. 도 10에서 첫 번째 제1 컨볼루션 블록에는 24k 개의 서로 다른 필터가 사용되고, 두 번째, 마지막 제1 컨볼루션 블록에는 각각 32k 개, 48k 개의 서로 다른 필터가 사용될 수 있다.For the configuration of the first convolution block, refer to FIG. 6. In FIG. 10, 24k different filters may be used for the first first convolution block, and 32k and 48k different filters may be used for the second and last first convolution blocks, respectively.

마지막 제1 컨볼루션 블록에서 최종 단계의 ReLU 함수가 적용된 데이터는 다음 단계, 예컨대 Pooing 레이어로 전달될 수 있다.Data to which the ReLU function of the last step is applied in the last first convolution block may be transferred to a next step, for example, a Pooing layer.

도 11은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.11 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 11을 참조하면, 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법은, 도 4의 단계(S120)와 같은 제1 컨볼루션 연산이 수행되는 단계를 포함한다. 제1 컨볼루션 블록을 사용하는 컨볼루션 연산이 수행되는 단계, 제2 컨볼루션 블록을 사용하는 컨볼루션 연산이 수행되는 단계가 세 번 반복되는 과정을 포함할 수 있다.Referring to FIG. 11, a keyword spotting method according to another embodiment of the present invention includes performing a first convolution operation as in step S120 of FIG. 4. A step of performing a convolution operation using a first convolution block and a step of performing a convolution operation using a second convolution block may be repeated three times.

상기 제1 컨볼루션 블록의 구성은 도 6을 참조하고, 상기 제2 컨볼루션 블록의 구성은 도 9를 참조하도록 한다. 도 11에서 첫 번째 제1 및 제2 컨볼루션 블록에는 24k 개의 서로 다른 필터가 사용되고, 두 번째 제1 및 제2 컨볼루션 블록에는 32k 개의 서로 다른 필터, 마지막 제1 및 제2 컨볼루션 블록에는 48k 개의 서로 다른 필터가 사용될 수 있다.Referring to FIG. 6 for a configuration of the first convolution block, and referring to FIG. 9 for a configuration of the second convolution block. In FIG. 11, 24k different filters are used in the first first and second convolution blocks, 32k different filters are used in the second first and second convolution blocks, and 48k are used in the last first and second convolution blocks. Different filters can be used.

도 12는 본 발명에 따른 키워드 스폿팅 방법의 효과를 설명하기 위한 성능 비교표이다.12 is a performance comparison table for explaining the effect of the keyword spotting method according to the present invention.

도 12의 표에서 모델 CNN-1 부터 Res15 까지는 종래 방법에 따른 음성 키워드 추출 모델을 사용한 결과를 나타낸다. 종래 방법에 따른 음성 키워드 추출 모델 중 CNN-2 모델이 추출 시간 측면에서 가장 우수한 결과(1.2ms)를 보였고, 정확도 측면에서는 Res15 모델이 가장 우수한 결과(95.8%)를 보였다.In the table of FIG. 12, models CNN-1 to Res15 show the results of using the voice keyword extraction model according to the conventional method. Among the voice keyword extraction models according to the conventional method, the CNN-2 model showed the best result (1.2ms) in terms of extraction time, and the Res15 model showed the best result (95.8%) in terms of accuracy.

TC-ResNet8 모델부터 TC-ResNet14-1.5 모델은 본 발명에 따른 키워드 스폿팅 방법이 적용된 모델을 나타낸다. 먼저, TC-ResNet8 모델은 도 10에 도시된 방법을 사용한 모델로, k 값으로 1이 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 16개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 첫 번째, 두 번째, 및 마지막 제1 컨볼루션 블록에서는 각각 24개, 32개, 및 48개의 서로 다른 필터가 사용된 것으로 이해할 수 있다.From the TC-ResNet8 model to the TC-ResNet14-1.5 model, the keyword spotting method according to the present invention is applied. First, the TC-ResNet8 model is a model using the method shown in FIG. 10, and is a model to which 1 is applied as a k value. Therefore, it can be understood that 16 different filters are used in the step in which the first convolution operation is performed. In addition, it can be understood that 24, 32, and 48 different filters are used in the first, second, and last first convolution blocks, respectively.

TC-ResNet8-1.5 모델은 도 10에 도시된 방법을 사용한 모델로, k 값으로 1.5가 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 24개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 첫 번째, 두 번째, 및 마지막 제1 컨볼루션 블록에서는 각각 36개, 48개, 및 72개의 서로 다른 필터가 사용된 것으로 이해할 수 있다.The TC-ResNet8-1.5 model is a model using the method shown in FIG. 10, and is a model to which 1.5 is applied as a k value. Therefore, it can be understood that 24 different filters are used in the step in which the first convolution operation is performed. In addition, it can be understood that 36, 48, and 72 different filters are used in the first, second, and last first convolution blocks, respectively.

TC-ResNet14 모델은 도 11에 도시된 방법을 사용한 모델로, k 값으로 1이 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 16개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 순차적으로 24개, 24개, 32개, 32개, 48개, 및 48개의 서로 다른 필터가 각각 제1, 제2, 제1, 제2, 제1, 및 제2 컨볼루션 블록에서 사용된 것으로 이해할 수 있다.The TC-ResNet14 model is a model using the method shown in FIG. 11, and is a model to which 1 is applied as a k value. Therefore, it can be understood that 16 different filters are used in the step in which the first convolution operation is performed. And, sequentially, 24, 24, 32, 32, 48, and 48 different filters are used in the first, second, first, second, first, and second convolution blocks, respectively. It can be understood as being done.

TC-ResNet14-1.5 모델은 도 11에 도시된 방법을 사용한 모델로, k 값으로 1.5가 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 24개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 순차적으로 36개, 36개, 48개, 48개, 72개, 및 72개의 서로 다른 필터가 각각 제1, 제2, 제1, 제2, 제1, 및 제2 컨볼루션 블록에서 사용된 것으로 이해할 수 있다.The TC-ResNet14-1.5 model is a model using the method shown in FIG. 11, and 1.5 is applied as a k value. Therefore, it can be understood that 24 different filters are used in the step in which the first convolution operation is performed. And, in sequence, 36, 36, 48, 48, 72, and 72 different filters are used in the first, second, first, second, first, and second convolution blocks, respectively. It can be understood as being done.

도 12에 도시되는 표를 참조하면, TC-ResNet8 모델은 1.1ms의 키워드 추출 시간을 기록하였으며, 이는 종래 방법에 따른 모델을 사용했을 때의 가장 빠른 기록(1.2ms)보다 우수하다. 또한, TC-ResNet8 모델을 사용할 경우 96.1%의 정확도를 확보할 수 있으므로, 정확도 측면에서도 종래 방법보다 우수한 것으로 이해할 수 있다.Referring to the table shown in FIG. 12, the TC-ResNet8 model recorded a keyword extraction time of 1.1 ms, which is superior to the fastest recording (1.2 ms) when the model according to the conventional method is used. In addition, since the TC-ResNet8 model can secure 96.1% accuracy, it can be understood that it is superior to the conventional method in terms of accuracy.

한편, TC-ResNet14-1.5 모델은 96.6%의 정확도를 기록하였으며, 이는 종래 방법에 따른 모델 및 본 발명에 따른 모델들 중에서 가장 정확한 수치이다.Meanwhile, the TC-ResNet14-1.5 model recorded an accuracy of 96.6%, which is the most accurate value among the models according to the conventional method and the models according to the present invention.

음성 키워드 추출의 정확도를 가장 중요한 척도로 보았을 때, 종래 방법 중 가장 우수한 Res15 모델의 키워드 추출 시간은 424ms이다. 그리고, 본 발명의 일 실시예에 해당하는 TC-ResNet14-1.5 모델의 키워드 추출 시간은 5.7ms이다. 따라서, 본 발명의 실시예들 중에서 키워드 추출 시간이 가장 느린 TC-ResNet14-1.5 모델을 사용하더라도 종래 방법보다 약 74.8배 빠르게 음성 키워드를 추출할 수 있다.Considering the accuracy of voice keyword extraction as the most important measure, the keyword extraction time of the Res15 model, which is the best among the conventional methods, is 424 ms. In addition, the keyword extraction time of the TC-ResNet14-1.5 model according to an embodiment of the present invention is 5.7 ms. Accordingly, even if the TC-ResNet14-1.5 model, which has the slowest keyword extraction time, is used among the embodiments of the present invention, it is possible to extract voice keywords about 74.8 times faster than the conventional method.

한편, 본 발명의 실시예들 중에서 키워드 추출 시간이 가장 빠른 TC-ResNet8 모델을 사용하는 경우, Res15 모델에 비해서 385배 빠른 속도로 음성 키워드를 추출할 수 있다.Meanwhile, when the TC-ResNet8 model having the fastest keyword extraction time among the embodiments of the present invention is used, voice keywords can be extracted at a speed 385 times faster than that of the Res15 model.

도 13은 본 발명의 일 실시예에 따른 키워드 스폿팅 장치의 구성을 개략적으로 나타내는 도면이다.13 is a diagram schematically showing a configuration of a keyword spotting apparatus according to an embodiment of the present invention.

도 13을 참조하면, 키워드 스폿팅 장치(20)는 프로세서(210), 및 메모리(220)를 포함할 수 있다. 본 실시예와 관련된 기술분야에서 통상의 지식을 가진 자라면, 도 13에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음을 알 수 있다.Referring to FIG. 13, the keyword spotting device 20 may include a processor 210 and a memory 220. Those of ordinary skill in the art related to the present embodiment can see that other general-purpose components may be further included in addition to the components shown in FIG. 13.

프로세서(210)는 키워드 스폿팅 장치(20)의 전체적인 동작을 제어하며, CPU 등과 같은 적어도 하나의 프로세서를 포함할 수 있다. 프로세서(210)는 각 기능에 대응되는 특화된 프로세서를 적어도 하나 포함하거나, 하나로 통합된 형태의 프로세서일 수 있다.The processor 210 controls the overall operation of the keyword spotting device 20 and may include at least one processor such as a CPU. The processor 210 may include at least one specialized processor corresponding to each function, or may be an integrated type of processor.

메모리(220)는 컨볼루션 뉴럴 네트워크에서 수행되는 컨볼루션 연산과 관련된 프로그램, 데이터, 또는 파일을 저장할 수 있다. 메모리(220)는 프로세서(210)에 의해 실행 가능한 명령어들을 저장할 수 있다. 프로세서(210)는 메모리(220)에 저장된 프로그램을 실행시키거나, 메모리(220)에 저장된 데이터 또는 파일을 읽어오거나, 새로운 데이터를 저장할 수 있다. 또한, 메모리(220)는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합으로 저장할 수 있다.The memory 220 may store programs, data, or files related to a convolution operation performed in a convolutional neural network. The memory 220 may store instructions executable by the processor 210. The processor 210 may execute a program stored in the memory 220, read data or a file stored in the memory 220, or store new data. In addition, the memory 220 may store program commands, data files, data structures, etc. alone or in combination.

프로세서(210)는 고 정밀도(high-precision) 연산기(예컨대, 32 비트 연산기)가 계층 구조로 설계되어 복수의 저 정밀도(low-precision) 연산기(예컨대, 8 비트 연산기)를 포함할 수 있다. 이 경우, 프로세서(210)는 고 정밀도 연산을 위한 명령어 및 저 정밀도 연산을 위한 SIMD(Single Instruction Multiple Data) 명령어를 지원할 수 있다. 비트 폭(bit-width)이 저 정밀도 연산기의 입력에 맞도록 양자화(quantization) 된다면, 프로세서(210)는 동일 시간 내에 비트 폭이 큰 연산을 수행하는 대신에 비트 폭이 작은 복수의 연산을 병렬적으로 수행함으로써, 컨볼루션 연산을 가속시킬 수 있다. 프로세서(210)는 소정의 이진 연산을 통해 컨볼루션 뉴럴 네트워크 상에서 컨볼루션 연산을 가속시킬 수 있다.The processor 210 may include a plurality of low-precision operators (eg, 8-bit operators) by designing a high-precision operator (eg, a 32-bit operator) in a hierarchical structure. In this case, the processor 210 may support an instruction for high-precision calculation and a single instruction multiple data (SIMD) instruction for low-precision calculation. If the bit-width is quantized to fit the input of a low-precision operator, the processor 210 executes a plurality of operations having a small bit width in parallel instead of performing an operation having a large bit width within the same time. By performing it as, it is possible to accelerate the convolution operation. The processor 210 may accelerate a convolution operation on a convolutional neural network through a predetermined binary operation.

프로세서(210)는 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득할 수 있다. 상기 입력 피처 맵은 상기 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과로부터 t×1×f (폭×높이×채널) 크기로 획득될 수 있다.The processor 210 may obtain an input feature map from the input voice. The input feature map may be obtained with a size of t×1×f (width×height×channel) from a result of MFCC (Mel Frequency Cepstral Coefficient) processing for the input voice.

또한, 프로세서(210)는 상기 입력 피처 맵과 동일한 채널 길이를 갖는 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션 연산을 수행한다. 그리고 이 때, 상기 n 개의 서로 다른 필터들의 폭 w1은 상기 입력 피처 맵의 폭보다 작게 설정될 수 있다.Further, the processor 210 performs a first convolution operation with the input feature map for each of n different filters having the same channel length as the input feature map. In this case, the width w1 of the n different filters may be set to be smaller than the width of the input feature map.

상기 n 개의 서로 다른 필터는 각각 서로 다른 음성을 판별할 수 있다. 예컨대, 제1 필터는 'a' 에 대응하는 음성을 판별하고, 제2 필터는 'o' 에 대응하는 음성을 판별할 수 있다. 상기 n 개의 서로 다른 필터와 컨볼루션 연산을 거치면서 각 필터의 특성에 대응하는 음성들은 그 소리의 특징이 더 강화될 수 있다.Each of the n different filters may discriminate different voices. For example, the first filter may determine a voice corresponding to'a', and the second filter may determine a voice corresponding to'o'. Voices corresponding to the characteristics of each filter may be further enhanced through the n different filters and convolution operations.

상기 n 개의 필터들의 폭은 w1 로 정의될 수 있는데, 상기 입력 피처 맵의 폭이 w 라 하면 w > w1 의 관계가 성립할 수 있다. 상기 제1 컨볼루션 연산에서 상기 필터들의 개수가 n 개인 경우, 총 n 개의 출력이 존재하게 되면, 컨볼루션 연산의 결과는 메모리(220)에 저장될 수 있다.The width of the n filters may be defined as w1. If the width of the input feature map is w, the relationship w>w1 may be established. When the number of filters is n in the first convolution operation, when there are a total of n outputs, the result of the convolution operation may be stored in the memory 220.

상기 제1 컨볼루션 연산에서는 스트라이드(stride) 값으로 '1'이 적용될 수 있다. 스트라이드는 컨볼루션 연산에서 필터가 입력 데이터를 건너뛰는 간격을 의미하는데, 스트라이드 갑으로 '1'이 적용되면 건너뛰는 데이터 없이 모든 입력 데이터와 컨볼루션 연산을 수행하게 된다.In the first convolution operation, '1' may be applied as a stride value. The stride refers to the interval at which the filter skips input data in the convolution operation. If '1' is applied as a stride cap, all input data and convolution operations are performed without skipped data.

또한, 프로세서(210)는 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행한다.Further, the processor 210 performs a second convolution operation with the result of the first convolution operation for each of different filters having the same channel length as the input feature map.

상기 제2 컨볼루션 연산에 사용되는 필터의 개수는 m 개일 수 있으며, 필터의 폭은 w2 로 정의될 수 있다. 이 때, 상기 제2 컨볼루션 연산에 사용되는 필터의 폭과 상기 제1 컨볼루션 연산에 사용되는 필터의 폭은 서로 같을 수도 있고, 서로 다를 수도 있다.The number of filters used in the second convolution operation may be m, and the width of the filter may be defined as w2. In this case, the width of the filter used for the second convolution operation and the width of the filter used for the first convolution operation may be the same or different from each other.

또한, 프로세서(210)는 앞선 컨볼루션 연산의 결과를 메모리(220)에 출력 피처 맵(Output Feature Map)으로 저장하고, 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출한다.In addition, the processor 210 stores the result of the previous convolution operation in the memory 220 as an output feature map, and applies the output feature map to the learned machine learning model to extract a voice keyword.

상기 출력 피처 맵은 음성 키워드를 추출하기 전에 최종적으로 획득되는 결과로서, 도 3(b)와 같은 형태를 가질 수 있다. 메모리(220)에 저장되는 출력 피처 맵은 바로 앞 단계에서 수행된 컨볼루션 연산의 결과로 이해할 수 있다. 따라서, 본 발명의 일 실시예에서는 상기 제2 컨볼루션 연산의 결과를 상기 출력 피처 맵으로 저장할 수 있다.The output feature map is a result finally obtained before extracting the voice keyword, and may have a shape as shown in FIG. 3(b). The output feature map stored in the memory 220 can be understood as a result of the convolution operation performed in the preceding step. Accordingly, in an embodiment of the present invention, the result of the second convolution operation may be stored as the output feature map.

한편, 상기 기계학습 모델은 풀링 레이어(Pooling Layer), 풀 커넥트 레이어(Full Connect Layer), 소프트맥스(Softmax) 연산 등을 포함할 수 있다. 또한, 본 명세서에서 설명되는 컨볼루션 연산에 이어 풀링 연산이 수행될 수 있으며, 풀링 연산 과정에서 발생할 수 있는 데이터 감소를 방지하기 위해 제로 패딩(Zero Padding) 기법이 적용될 수 있다.Meanwhile, the machine learning model may include a pooling layer, a full connect layer, and a softmax operation. In addition, a pooling operation may be performed following the convolution operation described herein, and a zero padding technique may be applied to prevent data reduction that may occur during the pooling operation.

이상에서 설명된 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비 휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다.The embodiments described above may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media may be any available media that can be accessed by a computer, and may include both volatile and non-volatile media, removable and non-removable media.

또한, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비 휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다.Further, the computer-readable medium may include a computer storage medium. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described above with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains can be implemented in other specific forms without changing the technical spirit or essential features of the present invention. You can understand that there is. Therefore, it should be understood that the embodiments described above are illustrative and non-limiting in all respects.

20: 키워드 스폿팅 장치
210: 프로세서
220: 메모리20: keyword spotting device
210: processor
220: memory

Claims

In the keyword spotting method using an artificial neural network,
Obtaining an input feature map from the input voice;
For each of n different filters having the same channel length as the input feature map, but having a width w1-w1 less than the width of the input feature map- Performing a first convolution operation;
Performing a second convolution operation with a result of the first convolution operation for each of different filters having the same channel length as the input feature map;
Storing the result of the previous convolution operation as an output feature map; And
Extracting a speech keyword by applying the output feature map to a learned machine learning model;
Including, keyword spotting method.

The method of claim 1,
The keyword spotting method, wherein a stride value of the first convolution operation is 1.

The method of claim 1,
The step of performing the second convolution operation,
Performing a first sub-convolution operation of m different filters having a width of w2 and a result of the first convolution operation;
Performing a second sub-convolution operation of m different filters having a width of w2 and a result of the first sub-convolution operation;
Performing a third sub-convolution operation of m different filters having a width of 1 and a result of the first convolution operation; And
Summing the results of the second and third sub-convolution operations;
Including, keyword spotting method.

The method of claim 3,
The keyword spotting method, wherein stride values of the first to third sub-convolution operations are 2, 1, and 2, respectively.

The method of claim 3,
The step of performing a third convolution operation; further comprising, the step of performing the third convolution operation,
Performing a fourth sub-convolution operation of l different filters having a width of w2 and a result of the second convolution operation;
Performing a fifth sub-convolution operation of l different filters having a width of w2 and a result of the fourth sub-convolution operation;
Performing a sixth sub-convolution operation of l different filters having a width of 1 and a result of the second convolution operation; And
Summing the results of the fifth and sixth sub-convolution operations;
Including, keyword spotting method.

The method of claim 3,
The step of performing a fourth convolution operation; further comprising, the step of performing the fourth convolution operation,
Performing a seventh sub-convolution operation of m different filters having a width of w2 and a result of the second convolution operation;
Performing an eighth sub-convolution operation of m different filters having a width of w2 and a result of the seventh sub-convolution operation; And
Summing a result of the second convolution operation and a result of the eighth sub-convolution operation;
Including, keyword spotting method.

The method of claim 6,
The stride value of the seventh and eighth sub-convolution operations is 1, the keyword spotting method.

The method of claim 1,
In the step of obtaining the input feature map, keyword spotting of acquiring an input feature map having a size of t × 1 × f (width × height × channel) from a result of MFCC (Mel Frequency Cepstral Coefficient) processed for the input speech. Way.
Here, t is time and f is frequency.

A computer-readable recording medium on which a program for performing the method according to any one of claims 1 to 8 is recorded.

In an apparatus for extracting a voice keyword using an artificial neural network,
A memory in which at least one program is stored; And
By executing the at least one program, comprising a processor for extracting a voice keyword using an artificial neural network,
The processor,
Acquire an input feature map from the input speech,
A first convolution operation with the input feature map is performed for each of n different filters having the same channel length as the input feature map, but the width w1-w1 is smaller than the width of the input feature map. Perform,
Performing a second convolution operation with a result of the first convolution operation for each of different filters having the same channel length as the input feature map,
Save the result of the previous convolution operation as an output feature map,
A keyword spotting device for extracting a voice keyword by applying the output feature map to a learned machine learning model.

The method of claim 10,
The keyword spotting device, wherein the stride value of the first convolution operation is 1.

The method of claim 10,
In performing the second convolution operation, the processor,
Perform a first sub-convolution operation of m different filters having a width of w2 and a result of the first convolution operation,
Perform a second sub-convolution operation of m different filters having a width of w2 and a result of the first sub-convolution operation,
M different filters having a width of 1 and a third sub-convolution operation with the result of the first convolution operation,
A keyword spotting device for summing the results of the second and third sub-convolution operations.

The method of claim 12,
The keyword spotting device, wherein stride values of the first to third sub-convolution operations are 2, 1, and 2, respectively.

The method of claim 12,
The processor further performs a third convolution operation, and in performing the third convolution operation, the processor,
Perform a fourth sub-convolution operation of l different filters having a width of w3 and a result of the second convolution operation,
A fifth sub-convolution operation is performed with l different filters having a width of w3 and the result of the fourth sub-convolution operation,
Perform a sixth sub-convolution operation of l different filters having a width of 1 and the result of the second convolution operation,
A keyword spotting device for summing the results of the fifth and sixth sub-convolution operations.

The method of claim 12,
The processor further performs a fourth convolution operation, and in performing the fourth convolution operation, the processor,
Perform a seventh sub-convolution operation of m different filters having a width w2 and a result of the second convolution operation,
Perform an eighth sub-convolution operation of m different filters having a width w2 and a result of the seventh sub-convolution operation,
A keyword spotting device for summing a result of the second convolution operation and a result of the eighth sub-convolution operation.

The method of claim 15,
The keyword spotting device, wherein the stride value of the seventh and eighth sub-convolution operations is 1.

The method of claim 10,
The processor obtains the input feature map of a size of t×1×f (width×height×channel) from a result of MFCC processing on the input voice.
Here, t is time and f is frequency.