KR20190135657A

KR20190135657A - Text-independent speaker recognition apparatus and method

Info

Publication number: KR20190135657A
Application number: KR1020180060925A
Authority: KR
Inventors: 이상훈; 이경준; 강지우
Original assignee: 연세대학교 산학협력단
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2019-12-09
Also published as: KR102064077B1

Abstract

The present invention relates to an apparatus and a method for accurately recognizing a speaker even if voice data for random text is applied. According to an embodiment of the present invention, the apparatus comprises: a voice/image obtaining unit obtaining voice data of a speaker and image data including all or part of the face of the speaker; a feature extracting unit extracting a text-independent image feature from the image data since the feature extracting unit is learned in advance according to a predesignated pattern recognition technique, extracting a voice feature irrelevant to a text from the voice data, and obtaining a feature value of the voice feature with the image feature weighted; and a speaker determining unit determining the speaker by comparing the feature value with a prestored speaker identification value.

Description

Syntax independent speaker recognition apparatus and method {TEXT-INDEPENDENT SPEAKER RECOGNITION APPARATUS AND METHOD}

본 발명은 화자 인식 장치 및 방법에 관한 것으로, 특히 구문 독립 화자 인식 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for speaker recognition, and more particularly, to an apparatus and method for syntax independent speaker recognition.

화자 인식(Speaker Identification)이란, 사용자가 미리 자신의 음성을 모듈에 등록하고, 이후 음성을 다시 입력 했을 때, 모듈이 입력된 음성을 분류(Classification)하여 화자를 식별하는 기술을 의미한다.Speaker identification refers to a technology for identifying a speaker by classifying the input voice when a user registers his or her own voice in a module in advance and then re-inputs the voice.

기존에는 입력된 음성 데이터를 데이터베이스에 기저장된 음성 데이터와의 유사성을 비교하여 화자를 인식함에 따라, 미리 정해진 단어 또는 문장에 대한 음성 데이터가 입력되어야만 화자 인식이 가능하다는 한계가 있다. 또한 주변 환경 및 화자의 상태에 따라 인식률이 크게 낮아지는 문제가 있었다.Conventionally, as the speaker recognizes the speaker by comparing the similarity with the voice data previously stored in the database, there is a limitation that speaker recognition is possible only when voice data for a predetermined word or sentence is input. In addition, there was a problem that the recognition rate is significantly lowered according to the surrounding environment and the speaker state.

한편 최근에는 인식률을 향상시키기 위해 딥 러닝(또는 기계 학습) 기법을 적용하는 연구가 활발하게 진행되고 있다. 딥 러닝 기법을 화자 인식에 적용하는 경우, 주변 환경 및 화자의 상태에 따른 인식률 저하를 줄여 화자 인식 성능을 향상 시킬 수 있다. 그러나 딥 러닝 기법은 학습을 필요로 하고, 이러한 학습의 용이성과 학습 이후 화자 인식률의 향상을 위해, 여전히 지정된 단어 또는 문장을 이용하도록 한다. 그러나 사용자들에게 동일한 문구만을 말하게 하여 화자를 인식하는 것은 사용자들에게 불편함을 초래하는 등 소비자 친화적이지 않은 비합리적인 방법이라는 문제가 있다.In recent years, researches are being actively applied to apply deep learning (or machine learning) techniques to improve recognition rates. When the deep learning technique is applied to speaker recognition, speaker recognition performance can be improved by reducing the recognition rate deterioration according to the surrounding environment and the speaker's state. However, deep learning techniques require learning, and in order to facilitate the learning and improve the speaker recognition rate after learning, still use the specified words or sentences. However, there is a problem that it is irrational for the user to be uncomfortable such as causing inconvenience to the users by having the users speak the same phrase.

한국 공개 특허 제10-2004-0067573호 (2004.07.30 공개)Korean Patent Publication No. 10-2004-0067573 (published Jul. 30, 2004)

본 발명의 목적은 임의의 구문에 대한 음성 데이터가 인가되더라도 화자를 정확하게 인식할 수 있는 구문 독립 화자 인식 장치 및 방법을 제공하는데 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide an apparatus and method for recognizing a speech-independent speaker that can accurately recognize a speaker even when speech data for any phrase is applied.

본 발명의 다른 목적은 음성 데이터와 함께 화자의 이미지 데이터를 이용하여, 화자를 정확하게 인식할 수 있는 구문 독립 화자 인식 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for syntax-independent speaker recognition capable of accurately recognizing a speaker using image data of the speaker together with voice data.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 화자 인식 장치는 화자의 음성 데이터 및 상기 화자의 얼굴 전체 또는 일부가 포함된 이미지 데이터를 획득하는 음성/이미지 획득부; 기지정된 패턴 인식 기법에 따라 미리 학습되어 상기 이미지 데이터로부터 구문 의존적인 이미지 특징을 추출하고, 상기 음성 데이터로부터 구문에 무관한 음성 특징을 추출하며, 상기 이미지 특징을 가중치로 하여 상기 음성 특징의 특징값을 획득하는 특징 추출부; 및 상기 특징값을 미리 저장된 화자 식별값과 비교하여 상기 화자를 판별하는 화자 판별부; 를 포함한다.Speaker recognition apparatus according to an embodiment of the present invention for achieving the above object is a voice / image acquisition unit for obtaining the voice data of the speaker and the image data including the whole or part of the speaker's face; It is pre-trained according to a predetermined pattern recognition technique to extract a syntax dependent image feature from the image data, extracts a speech feature irrelevant to a phrase from the speech data, and extracts the feature value of the speech feature by weighting the image feature. Feature extractor to obtain; And a speaker discriminating unit for comparing the feature value with a speaker identification value stored in advance to determine the speaker. It includes.

일 실시예에서 상기 특징 추출부는 기지정된 패턴 인식 기법에 따라 미리 학습되어 상기 이미지 데이터로부터 구문 의존적인 상기 화자의 표정 또는 입 모양 중 하나에 대한 상기 이미지 특징을 추출하는 이미지 특징 추출부; 기지정된 패턴 인식 기법에 따라 미리 학습되어 상기 음성 데이터로부터 구문에 무관한 상기 화자의 음정의 높낮이, 음색, 음조 중 적어도 하나에 대한 상기 음성 특징을 추출하는 음성 특징 추출부; 및 상기 음성 특징에 상기 이미지 특징을 가중치로 적용하고, 기지정된 패턴 인식 기법에 따라 미리 학습되어 상기 음성 특징으로부터 상기 특징값을 추출하는 통합 특징 추출부; 를 포함할 수 있다.In an embodiment, the feature extractor may include an image feature extractor configured to pre-learn according to a predetermined pattern recognition technique to extract the image feature of one of the speaker's expression or mouth shape that is syntax dependent from the image data; A speech feature extracting unit which is pre-learned according to a predetermined pattern recognition technique and extracts the speech feature of at least one of pitch, tone, and tone of the speaker, irrespective of the phrase, from the speech data; And an integrated feature extractor which applies the image feature as a weight to the speech feature and extracts the feature value from the speech feature by being pre-learned according to a predetermined pattern recognition technique. It may include.

다른 실시예에서 상기 특징 추출부는 기지정된 패턴 인식 기법에 따라 미리 학습되어 상기 이미지 데이터로부터 구문 의존적인 상기 화자의 표정 또는 입 모양 중 하나에 대한 상기 이미지 특징을 추출하는 이미지 특징 추출부; 및 기지정된 패턴 인식 기법에 따라 미리 학습되고, 상기 이미지 특징을 가중치로 적용하여, 상기 음성 데이터로부터 구문에 무관한 상기 화자의 음정의 높낮이, 음색, 음조 중 적어도 하나에 대한 상기 음성 특징을 특징값으로 추출하는 음성 특징 추출부; 를 포함할 수 있다.In another embodiment, the feature extractor may be pre-learned according to a predetermined pattern recognition technique to extract the image feature of one of the speaker's expression or mouth shape that is syntax dependent from the image data; And a speech value for at least one of pitch, pitch, and tone of the speaker, which is previously learned according to a predetermined pattern recognition technique, by applying the image feature as a weight, and is independent of a phrase from the speech data. Voice feature extraction unit to extract; It may include.

상기 이미지 특징 추출부는 상기 이미지 데이터가 단일 프레임의 이미지 데이터이면, 미리 학습된 2D CNN(2D Convolutional Neural Networks)으로 구현되고, 상기 이미지 데이터가 다수 프레임의 이미지 데이터이면, 미리 학습된 3D CNN(3D Convolutional Neural Networks)으로 구현되고, 상기 음성 특징 추출부 및 상기 통합 특징 추출부는 미리 학습된 2D CNN으로 구현 될 수 있다. The image feature extractor is implemented as 2D Convolutional Neural Networks (2D CNN) pre-trained if the image data is image data of a single frame, and 3D Convolutional (3D CNN) pre-trained if the image data is image data of multiple frames. Neural Networks), and the voice feature extractor and the integrated feature extractor may be implemented as a pre-learned 2D CNN.

상기 화자 인식 장치는 상기 음성/이미지 획득부에서 동영상 데이터가 인가되면, 상기 동영상 데이터를 상기 음성 데이터 및 상기 이미지 데이터로 분리하는 음성/이미지 변환부; 를 더 포함할 수 있다.The speaker recognition apparatus may include a voice / image converting unit that separates the video data into the voice data and the image data when the video data is applied from the voice / image obtaining unit; It may further include.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 화자 인식 방법은 화자의 음성 데이터 및 상기 화자의 얼굴 전체 또는 일부가 포함된 이미지 데이터를 획득하는 단계; 기지정된 패턴 인식 기법에 따라 미리 학습되어 상기 이미지 데이터로부터 구문 의존적인 이미지 특징을 추출하고, 상기 음성 데이터로부터 구문에 무관한 음성 특징을 추출하며, 상기 이미지 특징을 가중치로 하여 상기 음성 특징의 특징값을 획득하는 단계; 및 상기 특징값을 미리 저장된 화자 식별값과 비교하여 상기 화자를 판별하는 단계; 를 포함한다.Speaker recognition method according to another embodiment of the present invention for achieving the above object comprises the steps of acquiring the voice data of the speaker and the image data including the whole or part of the speaker's face; It is pre-trained according to a predetermined pattern recognition technique to extract a syntax dependent image feature from the image data, extracts a speech feature irrelevant to a phrase from the speech data, and extracts the feature value of the speech feature by weighting the image feature. Obtaining a; Determining the speaker by comparing the feature value with a speaker identification value previously stored; It includes.

따라서, 본 발명의 화자 인식 장치 및 방법은 화자의 음성 데이터와 함께 화자의 이미지 데이터를 획득하고, 이미지 데이터로부터 다수의 화자들의 표정, 입 모양 등과 같이 구문 구문에 따른 공통적 특징을 추출하고, 추출된 공통 특징을 음성 데이터에서 추출되는 화자 개인별 특징과 결합하여 화자를 인식함으로써, 구문 구문에 무관한 임의의 구문에 대한 음성 데이터로부터 화자를 인식할 수 있다. 그러므로, 화자 인식을 위한 단어 또는 문장이 별도로 지정될 필요가 없어 사용자의 편의성을 크게 높일 수 있으면서도, 화자 인식 성능을 향상 시킬 수 있다.Accordingly, the apparatus and method for recognizing the speaker of the present invention acquires the speaker's image data together with the speaker's voice data, extracts common features according to syntax syntax such as facial expressions, mouth shapes, etc. of a plurality of speakers from the image data, and extracts them. By recognizing the speaker by combining the common feature with the speaker-specific feature extracted from the voice data, the speaker can be recognized from the voice data for any phrase irrespective of the phrase syntax. Therefore, since words or sentences for speaker recognition do not need to be separately specified, the user's convenience can be greatly increased, and speaker recognition performance can be improved.

도1 은 본 발명의 일 실시예에 따른 화자 인식 장치의 개략적 구성을 나타낸다.
도2 는 본 발명의 일 실시예에 따른 화자 인식 장치의 개략적 구성을 나타낸다.
도3 은 본 발명의 일 실시예에 따른 화자 인식 방법을 설명하기 위한 도면이다.1 shows a schematic configuration of a speaker recognition apparatus according to an embodiment of the present invention.
2 shows a schematic configuration of a speaker recognition apparatus according to an embodiment of the present invention.
3 is a view for explaining a speaker recognition method according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings which illustrate preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and like reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part is said to "include" a certain component, it means that it may further include other components, without excluding other components unless otherwise stated. In addition, the terms "... unit", "... unit", "module", "block", etc. described in the specification mean a unit that processes at least one function or operation, which means hardware, software, or hardware. And software.

도1 은 본 발명의 일 실시예에 따른 화자 인식 장치의 개략적 구성을 나타낸다.1 shows a schematic configuration of a speaker recognition apparatus according to an embodiment of the present invention.

도1 을 참조하면, 본 실시예에 따른 화자 인식 장치는 화자를 인식하고자 하는 입력 데이터를 획득하고, 획득된 입력 데이터를 변환하는 입력 데이터 획득부(100), 입력 데이터 획득부에서 획득되고 변환된 입력 데이터로부터 음성 및 이미지 각각의 특징과 추출된 음성 및 이미지 특징이 결합된 특징을 추출하는 특징 추출부(200) 및 특징 추출부(200)에서 추출된 특징을 기반으로 화자를 판별하여 인식하는 화자 판별부(300)를 포함한다.Referring to FIG. 1, the apparatus for recognizing a speaker according to the present embodiment obtains input data for recognizing a speaker and converts the obtained input data into an input data obtaining unit 100 and an input data obtaining unit. A speaker that discriminates and recognizes a speaker based on the feature extractor 200 and the feature extractor 200 extracting a feature in which each feature of the speech and image and the extracted speech and image feature are combined from the input data. The determination unit 300 is included.

먼저 입력 데이터 획득부(100)는 인식하고자 하는 입력 데이터를 획득하고, 특징 추출부(200)가 입력 데이터로부터 특징을 추출할 수 있도록 변환한다. 음성/이미지 획득부(110)는 음성/이미지 획득부(110) 및 음성/이미지 변환부(120)를 포함한다.First, the input data acquisition unit 100 obtains input data to be recognized, and converts the feature extraction unit 200 to extract a feature from the input data. The voice / image acquisition unit 110 includes a voice / image acquisition unit 110 and a voice / image conversion unit 120.

음성/이미지 획득부(110)는 인식하고자 하는 입력 데이터로서 입력 데이터는 화자의 음성 데이터와 함께 이미지 데이터를 획득한다. 통상의 화자 인식 장치는 화자를 인식하기 위하여 음성 데이터만을 획득하는 반면, 본 실시예에서는 화자 인식의 정확도를 향상 시킬 수 있도록 화자의 이미지 데이터를 함께 획득한다. 여기서 획득되는 화자의 이미지 데이터는 음성 데이터에 대응하는 이미지 데이터다. 즉 화자가 음성 데이터를 발화하는 동안 획득되는 이미지 데이터며, 단일 프레임의 이미지 데이터일 수도 있으며 다수 프레임의 이미지 데이터일 수도 있다.The voice / image acquisition unit 110 obtains image data together with voice data of the speaker as input data to be recognized. A typical speaker recognition apparatus acquires only voice data in order to recognize a speaker, whereas in the present embodiment, the speaker acquires image data together to improve the accuracy of speaker recognition. The image data of the speaker obtained here is image data corresponding to voice data. That is, the image data is acquired while the speaker speaks the voice data, and may be image data of a single frame or image data of multiple frames.

특히 이미지 데이터가 다수 프레임의 이미지 데이터인 경우, 입력 데이터는 음성 데이터와 다수 프레임 이미지 데이터가 결합된 동영상 정보일 수 있다. 또한 이미지 데이터는 화자의 얼굴 전체의 이미지일 수도 있으나, 화자의 얼굴 중 일부, 특히 입 모양이 촬영된 이미지일 수도 있다.In particular, when the image data is image data of multiple frames, the input data may be moving picture information in which voice data and multiple frame image data are combined. In addition, the image data may be an image of the entire speaker's face, but may also be an image of a part of the speaker's face, particularly a mouth.

음성/이미지 변환부(120)는 음성/이미지 획득부(110)에서 획득한 음성 데이터 및 이미지 데이터를 특징 추출부(200)에서 요구되는 형태의 데이터로 변환한다. 음성/이미지 변환부(120)는 우선 입력 데이터가 동영상 데이터인 경우, 입력 데이터 획득부(110)는 획득된 입력 데이터를 음성 데이터와 이미지 데이터로 분리할 수 있다.The voice / image converter 120 converts the voice data and the image data acquired by the voice / image acquirer 110 into data of a form required by the feature extractor 200. When the input data is video data, the audio / image converter 120 may first divide the obtained input data into voice data and image data.

그리고 분리된 이미지 데이터는 특징 추출부(200)의 이미지 특징 추출부(210)가 특징을 추출하기 용이한 형태로 변환하고, 음성 데이터는 음성 특징 추출부(220)가 특징을 추출하기 용이한 형태로 변환한다.The separated image data is converted into a form in which the image feature extractor 210 of the feature extractor 200 easily extracts the feature, and the voice data is a form in which the voice feature extractor 220 easily extracts the feature. Convert to

일예로 음성/이미지 변환부(120)는 획득된 음성 데이터가 음성 파형 데이터인 경우, 주파수 스펙트럼 그래프 형태의 음성 데이터로 변환할 수 있다. 이를 위해, 음성/이미지 변환부(120)는 고속퓨리에 변환(FFT)를 수행할 수 있다. 그리고 음성/이미지 변환부(120)는 이미지 데이터의 해상도 및 크기를 변환할 수 있다. 또한 이미지 특징 추출부(210)가 단일 프레임의 이미지 데이터로부터 특징을 추출하도록 구성되고, 이미지 데이터가 다수 프레임의 이미지 데이터로 입력되는 경우, 음성/이미지 변환부(120)는 다수 프레임의 이미지 데이터중 하나의 프레임을 선택하여 전송하도록 구성될 수도 있다.For example, the voice / image converter 120 may convert the acquired voice data into voice data in the form of a frequency spectrum graph when the acquired voice data is voice waveform data. To this end, the voice / image converter 120 may perform Fast Fourier Transform (FFT). The voice / image converter 120 may convert the resolution and size of the image data. In addition, when the image feature extractor 210 is configured to extract a feature from image data of a single frame, and the image data is input as image data of multiple frames, the voice / image converter 120 may include a plurality of image data. It may be configured to select and transmit one frame.

그러나 음성/이미지 획득부(110)가 특징 추출부(200)에서 요구되는 형태의 음성 데이터와 이미지 데이터를 획득하도록 구성된 경우, 음성/이미지 변환부(120)는 생략될 수 있다.However, when the voice / image acquisition unit 110 is configured to acquire voice data and image data in the form required by the feature extractor 200, the voice / image conversion unit 120 may be omitted.

특징 추출부(200)는 이미지 데이터를 인가받아 이미지 특징을 추출하는 이미지 특징 추출부(210), 음성 데이터를 인가받아 음성 특징을 추출하는 음성 특징 추출부(220) 및 추출된 음성 특징과 이미지 특징이 통합된 특징을 추출하는 통합 특징 추출부(230)를 포함한다.The feature extractor 200 receives an image data to extract an image feature, an image feature extractor 210 to extract an image feature, a voice feature extractor 220 to extract a voice feature, and extracts the extracted voice feature and the image feature. An integrated feature extractor 230 extracts the integrated feature.

이미지 특징 추출부(210)는 입력 데이터 획득부(100)에서 획득된 이미지 데이터를 인가받아, 미리 지정된 패턴 인식 기법에 따라 분석함으로써 이미지 특징을 추출한다. 특히 본 실시예에서 이미지 특징 추출부(210)는 화자의 표정 및 입 모양과 같은 구문 의존적인 특징을 이미지 특징으로서 추출한다.The image feature extractor 210 receives the image data acquired by the input data acquirer 100 and extracts the image feature by analyzing the image data according to a predetermined pattern recognition technique. In particular, in the present embodiment, the image feature extractor 210 extracts syntax-dependent features such as a speaker's expression and mouth shape as image features.

일반적으로 입 모양, 구문 정보와 같이 서로 다른 개인간에도 공통되는 특징 또는 동일한 개인에서도 다른 변이를 보이는 특징은 일종의 노이즈로 작용하여, 화자 인식률을 감소시킬 수 있다. 이에 기존의 화자 인식 장치에서는 구문 의존적인 특징에 의한 영향을 최소화하도록 가능한 배제하였다. 그러나 본 실시예에서는 이미지 특징 추출부(210)에서 추출되는 구문 의존적인 정보를 음성 특징을 구분하는 가이드 라인으로 제공하여, 화자의 음성 특징을 최대한 활용할 수 있도록 한다.In general, features that are common among different individuals, such as mouth shape and syntax information, or features that show different variations in the same individual act as a kind of noise, thereby reducing the speaker recognition rate. Therefore, in the existing speaker recognition apparatus, it is possible to exclude the influence of the syntax-dependent feature to be minimized. However, in the present exemplary embodiment, syntax-dependent information extracted from the image feature extractor 210 is provided as a guideline for distinguishing the voice feature, thereby making the most of the speaker's voice feature.

이미지 특징 추출부(210)는 입력 데이터 획득부(100)로부터 단일 프레임의 이미지가 인가되는 경우, 일예로 미리 학습된 2차원 콘볼루션 신경망(2D Convolutional Neural Networks: 이하 2D CNN)으로 구현될 수 있다. 2D CNN은 2차원의 이미지에서 특징을 추출하기 위해 주로 이용되는 인공 신경망의 하나이다.When an image of a single frame is applied from the input data acquirer 100, the image feature extractor 210 may be implemented as, for example, a 2D convolutional neural network (2D CNN) that has been previously learned. . 2D CNN is one of the artificial neural networks mainly used to extract features from two-dimensional images.

그러나 입력 데이터 획득부(100)로부터 다수 프레임의 이미지가 인가되는 경우, 이미지 특징 추출부(210)는 2차원의 다수 프레임을 포함하는 이미지 데이터를 3차원 이미지 데이터로 인식하여 패턴을 추출하도록 구성될 수 있다. 일예로 이미지 특징 추출부(210)는 미리 학습된 3차원 콘볼루션 신경망(3D Convolutional Neural Networks: 이하 3D CNN)으로 구현될 수 있다.However, when a plurality of frames of images are applied from the input data obtaining unit 100, the image feature extracting unit 210 may be configured to recognize the image data including the plurality of two-dimensional frames as three-dimensional image data and extract a pattern. Can be. For example, the image feature extractor 210 may be embodied as 3D Convolutional Neural Networks (hereinafter, referred to as 3D CNN).

이미지 특징 추출부(210)는 구문 의존적인 다양한 특징을 추출하도록 미리 학습될 수 있으나, 여기서는 일예로 화자의 표정 또는 입 모양에 대한 특징을 추출하도록 구성된 것으로 가정한다.The image feature extractor 210 may be pre-learned to extract various syntax-dependent features. For example, it is assumed that the image feature extractor 210 is configured to extract features of a speaker's expression or mouth shape.

한편, 음성 특징 추출부(210)는 입력 데이터 획득부(100)에서 획득된 음성 데이터를 인가받아, 미리 지정된 패턴 인식 기법에 따라 분석함으로써 음성 특징을 추출한다. 특히 본 실시예에서 음성 특징 추출부(220)는 음정의 높낮이, 음색, 음조와 같이 구문과 무관한 화자의 개인적 특징을 음성 특징으로서 추출한다.Meanwhile, the voice feature extractor 210 receives the voice data acquired by the input data acquirer 100 and extracts the voice feature by analyzing the voice data according to a predetermined pattern recognition technique. In particular, in the present embodiment, the voice feature extractor 220 extracts, as voice features, personal features of the speaker that are not related to the phrase, such as pitch, tone and tone.

통합 특징 추출부(210)는 이미지 특징 추출부(210)에서 추출된 이미지 특징에 기반하여 음성 특징 추출부(210)에서 추출된 음성 특징을 다시 추출하여 통합 특징을 추출한다. 통합 특징 추출부(210)는 이미지 특징(여기서는 화자의 표정 또는 입 모양 특징 중 하나)에 따라 추출된 음성 특징에 서로 다른 가중치를 부가할 수 있다. 이때 이미지 특징은 음성 특징의 영역별 가중치로서 기능하여, 추출된 음성 특징에서도 더 중요한 영역을 지정할 수 있도록 한다. The integrated feature extractor 210 extracts the integrated feature by re-extracting the speech feature extracted by the speech feature extractor 210 based on the image feature extracted by the image feature extractor 210. The integrated feature extractor 210 may add different weights to the extracted speech feature according to the image feature (here, one of the speaker's expression or mouth shape). At this time, the image feature functions as a weight for each region of the speech feature, so that a more important region can be designated even in the extracted speech feature.

그리고 통합 특징 추출부(210)는 이미지 특징에서 제공되는 가중치에 따라 음성 특징에 대한 통합 특징을 추출하여 특징값을 출력한다. 통합 특징 추출부(210)에서 출력되는 특징값은 실수값으로 획득될 수 있다.The integrated feature extractor 210 extracts an integrated feature of the voice feature based on a weight provided from the image feature and outputs a feature value. The feature value output from the integrated feature extractor 210 may be obtained as a real value.

통합 특징 추출부(210) 또한 음성 특징 추출부(210)와 유사하게 일예로 미리 학습된 2D CNN으로 구현될 수 있다. 즉 본 실시예에서, 이미지 특징 추출부(210)와 음성 특징 추출부(220) 및 통합 특징 추출부(230)는 각각 지정된 딥-러닝 알고리즘에 따라 미리 학습된 인공 신경망이다.The integrated feature extractor 210 may also be implemented as a 2D CNN previously learned, for example, similarly to the speech feature extractor 210. In other words, in this embodiment, the image feature extractor 210, the voice feature extractor 220, and the integrated feature extractor 230 are artificial neural networks that have been previously learned according to a designated deep learning algorithm.

한편 화자 판별부(300)는 다수의 화자에 대한 특징값인 화자 식별값이 미리 저장되어 있으며, 특징 추출부(200)에서 인가되는 특징값을 미리 저장된 다수의 화자 식별값과 비교하여 가장 유사한 화자 식별값에 대응하는 화자를 판별한다. 즉 화자를 인식할 수 있다.On the other hand, the speaker discrimination unit 300 has a speaker identification value, which is a feature value for a plurality of speakers, is stored in advance, and compares the feature value applied by the feature extraction unit 200 with a plurality of speaker identification values stored in advance. Determine the speaker corresponding to the identification value. That is, the speaker can be recognized.

도2 는 본 발명의 다른 실시예에 따른 화자 인식 장치의 개략적 구성을 나타낸다.2 shows a schematic configuration of a speaker recognition apparatus according to another embodiment of the present invention.

도2 의 화자 인식 장치 또한 도1 의 화자 인식 장치와 마찬가지로, 입력 데이터 획득부(100), 특징 추출부(400) 및 화자 판별부(300)를 포함한다. 그리고 도2 에서 입력 데이터 획득부(100)와 화자 판별부(300)는 도1 과 동일한 구성을 가지므로 여기서는 상세하게 설명하지 않는다.The speaker recognition apparatus of FIG. 2 also includes an input data acquisition unit 100, a feature extraction unit 400, and a speaker determination unit 300, similarly to the speaker recognition apparatus of FIG. 1. In FIG. 2, since the input data obtaining unit 100 and the speaker discriminating unit 300 have the same configuration as that of FIG. 1, the detailed description will not be provided herein.

그러나 도2 의 화자 인식 장치에서 특징 추출부(400)는 도1 의 특징 추출부(200)와 달리 이미지 특징 추출부(410) 및 음성 특징 추출부(420)만을 포함한다. 즉 통합 특징 추출부(230)를 포함하지 않는다.However, in the speaker recognition apparatus of FIG. 2, unlike the feature extractor 200 of FIG. 1, the feature extractor 400 includes only the image feature extractor 410 and the voice feature extractor 420. That is, the integrated feature extractor 230 is not included.

이미지 특징 추출부(410)는 도1 의 이미지 특징 추출부(210)와 동일하게, 입력 데이터 획득부(100)에서 획득된 이미지 데이터를 인가받아, 미리 지정된 패턴 인식 기법에 따라 분석함으로써 이미지 특징을 추출하며, 특히 구문 의존적인 특징을 이미지 특징으로서 추출한다. 그러나 도2 의 화자 인식 장치에서 이미지 특징 추출부(410)는 추출된 이미지 특징을 음성 특징 추출부(420)로 전달한다.Similar to the image feature extractor 210 of FIG. 1, the image feature extractor 410 receives image data acquired by the input data acquirer 100 and analyzes the image feature by analyzing a predetermined pattern recognition technique. Extracts, in particular, syntax dependent features as image features. However, in the speaker recognition apparatus of FIG. 2, the image feature extractor 410 transfers the extracted image feature to the voice feature extractor 420.

한편, 음성 특징 추출부(420)는 입력 데이터 획득부(100)에서 획득된 음성 데이터와 함께, 이미지 특징 추출부(410)에서 전달되는 이미지 특징을 인가받아, 미리 지정된 패턴 인식 기법에 따라 분석함으로써 음성 특징을 추출한다.Meanwhile, the voice feature extractor 420 receives the image feature transmitted from the image feature extractor 410 together with the voice data acquired by the input data acquirer 100 and analyzes the image feature according to a predetermined pattern recognition technique. Extract speech features.

도1 의 화자 인식 장치에서도 이미지 특징 추출부(210)에서 추출된 이미지 특징은 음성 특징의 중요 영역을 지정하기 위해 이용된다. 이에 도2 의 화자 인식 장치에서는 이미지 특징 추출부(410)에서 추출된 이미지 특징을 곧바로 음성 특징 추출부(420)로 전달하고, 음성 특징 추출부(420)는 이미지 특징 추출부(410)에서 전달된 이미지 특징에 따라 음성 데이터의 각 영역에 서로 다른 가중치를 부가하여 음성 특징을 추출함으로써, 통합 특징 추출부(430)를 생략할 수 있도록 하였다.In the speaker recognition apparatus of FIG. 1, the image feature extracted by the image feature extractor 210 is used to designate an important region of the voice feature. In the speaker recognition apparatus of FIG. 2, the image feature extracted by the image feature extractor 410 is directly transmitted to the voice feature extractor 420, and the voice feature extractor 420 is transferred by the image feature extractor 410. The integrated feature extractor 430 may be omitted by extracting the voice feature by adding different weights to the respective regions of the voice data according to the image feature.

특히 입력 데이터가 음성 데이터와 이미지 데이터가 모두 포함된 동영상 데이터인 경우, 음성 데이터에서 중요 영역(예를 들면, 시간 구간) 등을 용이하게 지정하기 위해, 이미지 특징이 이용될 수 있다. 즉 이미지 특징은 음성 데이터의 해당 시간 구간의 가중치를 높게 지정할 수 있다.In particular, when the input data is video data including both voice data and image data, an image feature may be used to easily designate an important area (eg, a time interval) in the voice data. That is, the image feature may designate a high weight of the corresponding time section of the voice data.

도3 은 본 발명의 일 실시예에 따른 화자 인식 방법을 설명하기 위한 도면이다.3 is a view for explaining a speaker recognition method according to an embodiment of the present invention.

도1 을 참조하여 도3 의 화자 인식 방법을 설명하면, 우선 화자를 인식하고자 하는 입력 데이터를 획득한다(S10). 여기서 획득되는 입력 데이터는 화자의 음성 데이터와 함께 이미지 데이터가 포함된 데이터이다. 그리고 이미지 데이터는 단일 프레임의 이미지 데이터일 수도 있으며 다수 프레임의 이미지 데이터일 수도 있다. 또한 입력 데이터는 음성 데이터와 이미지 데이터가 결합된 동영상 데이터일 수도 있다.Referring to FIG. 1, the speaker recognition method of FIG. 3 will be described. First, input data to recognize a speaker is obtained (S10). The input data obtained here is data including image data together with the speaker's voice data. The image data may be image data of a single frame or image data of multiple frames. The input data may also be video data in which voice data and image data are combined.

만일 입력 데이터가 동영상 데이터인 경우, 입력 데이터 획득부(100)는 동영상 데이터에서 이미지 데이터와 음성 데이터를 분리할 수 있다. 그리고 이미지 데이터와 음성 데이터 각각을 특징 추출부(200)에서 특징을 추출하기 용이한 형태의 데이터로 변환할 수 있다. 일예로 입력 데이터 획득부(100)는 음성 데이터를 주파수 스펙트럼 그래프 형태의 데이터로 변환할 수 있다.If the input data is moving image data, the input data acquisition unit 100 may separate image data and audio data from the moving image data. Each of the image data and the voice data may be converted into data in a form that is easy to extract a feature from the feature extractor 200. For example, the input data acquirer 100 may convert voice data into data in the form of a frequency spectrum graph.

특징 추출부(200)는 입력 데이터 획득부(100)에서 획득된 입력 데이터 중 이미지 데이터를 인가받고, 미리 학습된 패턴 인식 기법에 따라 이미지 데이터의 특징을 추출한다(S20). 여기서 미리 학습된 패턴 인식 기법은 일예로 2D CNN 또는 3D CNN일 수 있다. 만일 입력된 이미지 데이터가 단일 프레임 이미지인 경우, 2D CNN 기법으로 학습되어 이미지 데이터의 패턴을 인식할 수 있으며, 입력된 이미지 데이터가 다수 프레임 이미지인 경우, 3D CNN 기법으로 학습되어 이미지 데이터의 패턴을 인식할 수 있다.The feature extractor 200 receives image data from the input data acquired by the input data acquirer 100, and extracts a feature of the image data according to a pre-learned pattern recognition technique (S20). Here, the pre-learned pattern recognition technique may be, for example, 2D CNN or 3D CNN. If the input image data is a single frame image, it can be learned by 2D CNN technique to recognize the pattern of the image data.If the input image data is multiple frame image, it is learned by the 3D CNN technique and the pattern of the image data is learned. I can recognize it.

이때, 특징 추출부(200)는 이미지 데이터로부터 화자의 표정 또는 입 모양 등과 같은 구문 의존적 특징을 추출할 수 있다.In this case, the feature extractor 200 may extract a syntax dependent feature such as a speaker's expression or mouth shape from the image data.

한편, 특징 추출부(200)는 입력 데이터 획득부(100)에서 획득된 입력 데이터 중 음성 데이터를 인가받고, 미리 학습된 패턴 인식 기법에 따라 음성 데이터의 특징을 추출한다(S30). 특징 추출부(200)는 이미지 특징과 달리 음성 데이터로부터는 음정의 높낮이, 음색, 음조와 같이 구문과 무관한 화자의 개인적 특징을 추출한다.Meanwhile, the feature extractor 200 receives voice data from the input data acquired by the input data acquirer 100, and extracts a feature of the voice data according to a previously learned pattern recognition technique (S30). Unlike the image feature, the feature extractor 200 extracts personal features of the speaker, which are not related to the phrase, such as pitch, pitch, tone, and the like, from the voice data.

그리고 특징 추출부(200)는 추출된 이미지 특징과 음성 특징을 결합하여, 통합 특징을 추출한다(S40). 여기서 특징 추출부(200)는 이미지 특징에 기반하여 음성 특징에 대해 영역별 가중치를 부가하여 음성 특징을 다시 추출함으로써, 특징값을 획득한다.The feature extractor 200 combines the extracted image feature and the voice feature to extract an integrated feature (S40). Here, the feature extractor 200 extracts the voice feature by adding weights for each voice feature to the voice feature based on the image feature to obtain a feature value.

다만 도3 에서는 이미지 특징을 추출하는 단계(S20)와 음성 특징을 추출하는 단계(S30)가 병렬로 수행되고, 통합 특징을 추출하는 단계(S40)가 별도로 수행되는 것으로 도시하였으나, 도2 의 화자 인식 장치에서와 같이, 이미지 특징을 추출하는 단계(S20)가 먼저 수행되고, 음성 특징을 추출하는 단계(S30)는 이미지 특징을 추출하는 단계(S20) 이후에 수행되도록 구성될 수도 있다.In FIG. 3, the extracting of the image feature (S20) and the extracting the voice feature (S30) are performed in parallel, and the extracting the integrated feature (S40) is separately performed. As in the recognition apparatus, the step S20 of extracting an image feature may be performed first, and the step S30 of extracting a voice feature may be configured to be performed after the step S20 of extracting an image feature.

이 경우, 음성 특징을 추출하는 단계(S30)는 이미지 특징을 추출하는 단계(S20)에서 획득된 이미지 특징을 기반하여, 음성 데이터에 가중치를 적용하여 특징을 추출하도록 구성될 수 있다. 즉 통합 특징을 추출하는 단계(S40)가 생략될 수 있다.In this case, the extracting of the speech feature (S30) may be configured to extract the feature by applying a weight to the speech data based on the image feature acquired in the extracting of the feature (S20). In other words, the step S40 of extracting the integrated feature may be omitted.

특징 추출부(200)에서 특징값이 획득되면, 화자 판별부(300)는 미리 저장된 화자별 특징값인 화자 식별값과 획득된 특징값을 비교하여, 화자를 판별한다(S50).When the feature value is acquired by the feature extractor 200, the speaker determiner 300 compares the speaker identification value, which is a prestored feature value for each speaker, with the acquired feature value to determine a speaker (S50).

결과적으로 본 실시예에 따른 화자 인식 장치 및 방법은 이미지 데이터로부터 화자의 구문 의존적 특징을 추출하는 반면, 음성 데이터로부터는 구문에 무관한 특징을 추출한 후, 구문 의존적 특징에 기반하여 구문에 무관하게 음성 데이터로부터 화자를 인식할 수 있도록 한다. 따라서 화자가 지정된 단어 또는 문장을 발화하지 않고, 임의의 구문을 발화하더라도 화자를 정확하게 인식할 수 있도록 한다.As a result, the speaker recognition apparatus and method according to the present embodiment extract the syntax-dependent feature of the speaker from the image data, while extracting a feature not related to the syntax from the speech data, and then, regardless of the syntax based on the syntax-dependent feature, Allows the speaker to recognize the data. Therefore, the speaker can accurately recognize the speaker even if the speaker does not utter a specified word or sentence, but utters any phrase.

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the invention can be implemented as a computer program stored in a medium for execution in a computer. The computer readable media herein can be any available media that can be accessed by a computer and can also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and includes ROM (readable) Dedicated memory), RAM (random access memory), CD (compact disk) -ROM, DVD (digital video disk) -ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

100: 입력 데이터 획득부 110: 음성/이미지 획득부
120: 음성/이미지 변환부 200, 400: 특징 추출부
210, 410: 이미지 특징 추출부 220, 420: 음성 특징 추출부
230: 통합 특징 추출부 300: 화자 판별부100: input data acquisition unit 110: voice / image acquisition unit
120: voice / image conversion unit 200, 400: feature extraction unit
210, 410: image feature extraction unit 220, 420: voice feature extraction unit
230: integrated feature extraction unit 300: speaker discrimination unit

Claims

A voice / image acquisition unit for obtaining image data including all or part of the speaker's voice data and the speaker's face;
It is pre-trained according to a predetermined pattern recognition technique to extract a syntax dependent image feature from the image data, extracts a speech feature irrelevant to a phrase from the speech data, and extracts the feature value of the speech feature by weighting the image feature. Feature extractor to obtain; And
A speaker discrimination unit for discriminating the speaker by comparing the feature value with a speaker identification value previously stored; Speaker recognition device comprising a.

The method of claim 1, wherein the feature extraction unit
An image feature extraction unit which is previously learned according to a predetermined pattern recognition technique and extracts the image feature of one of the speaker's expression or mouth shape which is syntax dependent from the image data;
A speech feature extracting unit which is pre-learned according to a predetermined pattern recognition technique and extracts the speech feature of at least one of pitch, tone, and tone of the speaker, irrespective of the phrase, from the speech data; And
An integrated feature extracting unit applying the image feature as a weight to the speech feature and extracting the feature value from the speech feature by being pre-learned according to a predetermined pattern recognition technique; Speaker recognition device comprising a.

The method of claim 1, wherein the feature extraction unit
An image feature extraction unit which is previously learned according to a predetermined pattern recognition technique and extracts the image feature of one of the speaker's expression or mouth shape which is syntax dependent from the image data; And
It is learned in advance according to a known pattern recognition technique, and by applying the image feature as a weight, the voice feature for at least one of pitch, tone, and tone of the speaker is irrelevant to the phrase from the voice data. A voice feature extraction unit to extract; Speaker recognition device comprising a.

The image feature extractor of claim 2, wherein the image feature extracting unit
If the image data is image data of a single frame, it is implemented with 2D Convolutional Neural Networks (2D CNN) pre-trained. If the image data is image data of multiple frames, it is implemented with 3D Convolutional Neural Networks (3D CNN) pre-trained. Become,
The voice feature extraction unit
Speaker recognition device implemented as a pre-learned 2D CNN.

The method of claim 2, wherein the integrated feature extraction unit
Speaker recognition device implemented as a pre-learned 2D CNN.

The apparatus of claim 1, wherein the speaker recognition apparatus comprises:
A voice / image converter for separating the video data into the voice data and the image data when the video data is applied from the voice / image obtaining unit; Speaker recognition device further comprising.

Obtaining voice data of the speaker and image data including all or a part of the speaker's face;
It is pre-trained according to a predetermined pattern recognition technique to extract a syntax dependent image feature from the image data, extracts a speech feature irrelevant to a phrase from the speech data, and extracts the feature value of the speech feature by weighting the image feature. Obtaining a; And
Determining the speaker by comparing the feature value with a speaker identification value previously stored; Speaker recognition method comprising a.

8. The method of claim 7, wherein obtaining the feature value
Extracting the image feature for one of the speaker's facial expression or mouth shape which has been learned in advance according to a known pattern recognition technique and is syntax dependent;
Extracting the speech feature for at least one of pitch, tone and tone of the speaker, which is previously learned according to a predetermined pattern recognition technique and is unrelated to the phrase, from the speech data; And
Applying the image feature to the speech feature as a weight and extracting the feature value from the speech feature by being pre-learned according to a predetermined pattern recognition technique; Speaker recognition method comprising a.

8. The method of claim 7, wherein obtaining the feature value
Extracting the image feature for one of the speaker's facial expression or mouth shape which has been learned in advance according to a known pattern recognition technique and is syntax dependent; And
It is learned in advance according to a known pattern recognition technique, and by applying the image feature as a weight, the voice feature for at least one of pitch, tone, and tone of the speaker is irrelevant to the phrase from the voice data. Extracting; Speaker recognition method comprising a.

10. The method of claim 8 or 9, wherein extracting the image feature
If the image data is image data of a single frame, the image feature is extracted using 2D CNN (2D Convolutional Neural Networks) pre-trained, and if the image data is image data of multiple frames, 3D CNN (3D) Extract image features using Convolutional Neural Networks,
Extracting the feature value
Speaker recognition method for extracting the feature value using a pre-trained 2D CNN.

The method of claim 8, wherein the extracting of the voice feature comprises:
Speaker recognition method using the pre-learned 2D CNN to extract the speech feature.

The method of claim 7, wherein the speaker recognition method is
Separating the video data into the audio data and the image data when the video data is acquired in the acquiring of the image data; Speaker recognition method further comprising.