KR101809511B1

KR101809511B1 - Apparatus and method for age group recognition of speaker

Info

Publication number: KR101809511B1
Application number: KR1020160099591A
Authority: KR
Inventors: 권순일; 백성욱; 전석봉; 손귀영
Original assignee: 세종대학교산학협력단
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2017-12-15

Abstract

The present invention provides a device for recognizing an age group of a speaker based on a speech pattern including hesitation of a speaker or delayed conversation reaction, and a method for recognizing an age group using the same. The device for recognizing an age group of a speaker comprises a memory in which a program recognizing the age group of the speaker is stored and a processor executing the program stored in the memory. The processor extracts a speech morphological feature factor of the speaker from a sound signal of the speaker according to the execution of the program, and determines the age group of the speaker based on the extracted speech morphological feature factor and an age group classification model. Also, the age group classification model is generated based on a plurality of sound signals collected for each age group. The speech morphological feature factor includes at least one of hesitation of the speaker and the delayed conversation reaction.

Description

[0001] APPARATUS AND METHOD FOR AGE GROUP RECOGNITION OF SPEAKER [0002]

본 발명은 발화자의 연령대 인식 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for recognizing age of a speaker.

종래의 음성 신호에 기초하여 발화자의 성별 및 연령대 등의 인구통계학적 정보(demographical information)를 판별하는 음성 신호 기반 판별 기법은 음성 신호로부터 주파수 또는 주파수에서 파생된 음향 정보 등과 같은 특징(feature)을 이용한다. 음성 신호 기반 판별 기법은 특징이 추출되면, 추출된 특징의 차이에 기초하여 발화자의 인구통계학적 정보를 판별한다. A speech signal-based discrimination technique for discriminating demographic information such as gender and age of a speaker based on a conventional speech signal uses features such as frequency information or sound information derived from the speech signal from the speech signal . The speech signal based discrimination technique distinguishes the demographic information of the speaker based on the extracted feature differences.

음성 신호 기반 판별 기법은 음성 신호로부터 특징을 추출하기 위하여, 선형 예측 계수(linear predictive coefficient) 방법, 켑스트럽(cepstrum) 방법, 멜프리퀀스 켑스트럼(Mel frequency cepstral coefficient; MFCC) 방법 및 주파수 대역 별 에너지(filter bank energy) 방법 등을 사용한다. 또한, 추출된 특징을 이용하여 성별을 판별하기 위하여, 음성 신호 기반 판별 기법은 가우시안 혼합 모델(Gaussian mixture mode), 신경망 모델(neural network model), 지지 벡터 머신(support vector machine) 및 은닉 마코브 모델(hidden MarKov mode)등과 같은 기계학습 알고리즘(machine learning algorithm)을 활용할 수 있다.In order to extract features from speech signals, the speech signal-based discrimination technique uses a linear predictive coefficient method, a cepstrum method, a Mel frequency cepstral coefficient (MFCC) method, and a frequency And a filter bank energy method. In order to discriminate gender using the extracted features, the speech signal based discrimination technique is classified into Gaussian mixture mode, neural network model, support vector machine, and hidden Markov model and a hidden learning algorithm (hidden MarKov mode).

이와 같이, 음성 신호 기반 판별 기법은 음성 신호에 포함된, 인구통계학적 정보에 따른 음향적인 특징을 이용하므로, 주파수 차이가 뚜렷하지 않은 목소리를 가진 발화자의 성별을 판별하기 어렵다는 단점이 있다. 그러므로 단순히 음향적 정보에 따라 성별을 판별하는 종래의 성별 판별 기법에 대한 보완이 필요하다. As described above, the speech signal based discrimination technique uses acoustical characteristics according to the demographic information included in the speech signal, so that it is difficult to discriminate the gender of the speaker having a voice whose frequency difference is not clear. Therefore, it is necessary to supplement the conventional gender discrimination technique for discriminating gender based on acoustic information.

이와 관련하여, 한국 등록특허공보 제10-1189765호(발명의 명칭: 음성 및 영상에 기반한 성별-연령 판별방법 및 그 장치)는 입력 받은 영상 정보 및 음성 정보로부터 특정인의 성별 및 연령을 식별할 수 있는 방법 및 그 장치를 개시하고 있다. 구체적으로 이 발명은 성별정보와 연령정보의 상호 연관성을 고려하여 음성인식 및 얼굴인식을 조합 수행함으로써 정확하게 성별 및 연령을 연산할 수 있다.In this regard, Korean Patent Registration No. 10-1189765 (entitled " Method and apparatus for discriminating gender-age based on voice and image) discloses a method and apparatus for discriminating gender and age of a specific person from input image information and voice information And discloses the device. Specifically, the present invention can accurately calculate gender and age by performing voice recognition and face recognition in combination in consideration of the correlation between gender information and age information.

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 발화자의 망설임 또는 대화 반응 지연 등을 포함하는 발화형태에 기초하여 발화자의 연령대를 인식하는 장치 및 이를 이용한 연령대 인식 방법을 제공한다.Disclosure of Invention Technical Problem [8] The present invention provides a device for recognizing the age range of a speaking person based on a speaking style including a hesitation of a speaking person or delayed conversation reaction, and an age recognition method using the same.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.It should be understood, however, that the technical scope of the present invention is not limited to the above-described technical problems, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 발화자의 연령대 인식 장치는 발화자의 연령대를 인식하는 프로그램이 저장된 메모리 및 메모리에 저장된 프로그램을 실행하는 프로세서를 포함한다. 이때, 프로세서는 프로그램의 실행에 따라, 발화자의 음성 신호로부터 발화자의 발화형태적 특징 요소를 추출하고, 추출된 발화형태적 특징 요소 및 연령대 분류 모델에 기초하여 발화자의 연령대를 판별한다. 그리고 연령대 분류 모델은 연령대 별로 수집된 복수의 음성 신호에 기초하여 생성된 것이며, 발화형태적 특징 요소는 발화자의 망설임 및 대화 반응 지연 중 적어도 하나 이상을 포함한다. According to a first aspect of the present invention, there is provided an apparatus for recognizing the age of a speaking person, comprising a memory for storing a program recognizing the age of the speaker and a processor for executing a program stored in the memory. At this time, according to the execution of the program, the processor extracts the utterance type feature element of the utterer from the utterance voice signal, and determines the age range of the utterance based on the extracted utterance type feature element and the age group classification model. And, the age classification model is generated based on a plurality of voice signals collected for each age group, and the utterance style feature element includes at least one of talker hesitation and dialogue response delay.

또한, 본 발명의 제 2 측면에 따른 인식 장치의 발화자의 연령대 인식 방법은 발화자의 음성 신호로부터 발화자의 발화형태적 특징 요소를 추출하는 단계; 및 추출된 발화형태적 특징 요소 및 연령대 분류 모델에 기초하여 발화자의 연령대를 판별하는 단계를 포함한다. 이때, 연령대 분류 모델은 연령대 별로 수집된 복수의 음성 신호에 기초하여 생성된 것이며, 발화형태적 특징 요소는 발화자의 망설임 및 대화 반응 지연 중 적어도 하나 이상을 포함한다. According to a second aspect of the present invention, there is provided a method of recognizing the age of a speaking person of a recognizing device, comprising the steps of: extracting a speaking characteristic feature element of a speaking person from a speech signal of the speaking person; And determining the ages of the utterances based on the extracted uttered morphological feature elements and the age classification model. At this time, the age group classification model is generated based on a plurality of voice signals collected for each age group, and the utterance pattern feature element includes at least one of a talker's hesitation and an interactive response delay.

본 발명은 발화자의 음성 신호로부터 추출되는 망설임 및 대화 반응 지연 등의 발화형태적 특징 요소에 기초하여 생성된 연령대 분류 모델을 이용하여 사용자의 연령대를 인식할 수 있다. 그러므로 본 발명은 미리 저장된 개인 정보 없이 연령대별 음성학적인 특징 요소에 기초하여 발화자의 연령대를 판단할 수 있다.The present invention can recognize the age range of a user by using an age class classification model generated based on a speech feature feature such as hesitation and speech response delay extracted from a speech signal of a speaker. Therefore, the present invention can determine the age range of a speaking person based on phonetic characteristic elements for each age group without previously stored personal information.

또한, 본 발명은 발화자의 음성으로부터 인구통계학적 정보를 자동으로 추출하고 인식할 수 있으므로, 각종 전자기기의 개인 맞춤형 서비스에 활용이 가능하다. In addition, the present invention can extract and recognize demographic information automatically from the voice of a speaker, and thus can be utilized for personalized service of various electronic apparatuses.

도 1은 본 발명의 일 실시예에 따른 연령대 인식 장치의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 연령대 인식 프로그램의 블록도이다.
도 3은 본 발명의 일 실시예에 따른 연령대 인식 방법의 순서도이다. 1 is a block diagram of an age recognition apparatus according to an embodiment of the present invention.
2 is a block diagram of an age recognition program according to an embodiment of the present invention.
3 is a flowchart of an age recognition method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when a part is referred to as "including " an element, it does not exclude other elements unless specifically stated otherwise.

다음은 도 1 및 도 2를 참조하여, 본 발명의 일 실시예에 따른 발화자의 연령대 인식 장치(100)를 설명한다. Next, an apparatus 100 for recognizing the age of a speaking person according to an embodiment of the present invention will be described with reference to FIGS. 1 and 2. FIG.

도 1은 본 발명의 일 실시예에 따른 연령대 인식 장치(100)의 블록도이다.1 is a block diagram of an age recognition apparatus 100 according to an embodiment of the present invention.

연령대 인식 장치(100)는 발화자의 음성 신호로부터 발화형태적 특징요소를 추출하여 발화자의 연령대를 인식한다. 이때, 연령대 인식 장치(100)는 메모리(110) 및 프로세서(120)를 포함한다. 또한, 연령대 인식 장치(100)는 입력 모듈(130)을 포함할 수 있다. The age recognizing apparatus 100 extracts an utterance pattern feature element from the speech signal of the speaker and recognizes the age range of the speaker. At this time, the age recognition apparatus 100 includes a memory 110 and a processor 120. [ In addition, the age recognition device 100 may include an input module 130.

연령대는 발화자의 연령을 복수의 구간으로 구분한 것이다. 예를 들어, 연령대는 10대, 20대 및 30대와 같이, 발화자의 연령을 미리 정해진 구간 단위로 나눈 것일 수 있다. 또한, 연령대는 어린이, 청소년, 청년 및 노년 등과 같이 미리 정해진 기준에 따라 구분된 것일 수 있으나, 이에 한정된 것은 아니다.The age group divides the age of the speaker into a plurality of sections. For example, the age range may be the age of the speaker divided by a predetermined interval, such as teenagers, 20s, and 30s. Also, the age group may be classified according to predetermined criteria such as, but not limited to, children, adolescents, young adults and old people.

입력 모듈(130)은 발화자의 음성 신호를 수신할 수 있다. 이때, 입력 모듈(130)은 마이크 등과 같은 오디오 장치를 통하여 입력되는 발화자의 아날로그 입력 신호를 프로세서(120)로 전달하거나, 발화자의 아날로그 입력 신호를 디지털로 변환하여 프로세서(120)로 전달하는 것일 수 있다. 예를 들어, 입력 모듈(130)은 연령대 인식 장치(100)에 탑재된 사운드 카드(sound card), 사운드 칩 세트(sound chipset)일 수 있으나, 이에 한정된 것은 아니다. The input module 130 may receive the voice signal of the speaker. At this time, the input module 130 may transmit the analog input signal of the speaker input through the audio device such as a microphone to the processor 120, or may convert the analog input signal of the speaker into digital and transmit the digital input signal to the processor 120 have. For example, the input module 130 may be, but is not limited to, a sound card or a sound chipset mounted on the age recognition apparatus 100.

또는, 입력 모듈(130)은 별도의 모바일 장치 및 컴퓨팅 장치 등에서 수집된 발화자의 음성 신호를 수신하는 것일 될 수 있다. 예를 들어, 입력 모듈(130)은 다른 장치와 데이터 통신을 수행하는 통신 모듈이 될 수 있다.Alternatively, the input module 130 may be one that receives the speech signal of a speaker that is collected in a separate mobile device, computing device, or the like. For example, the input module 130 may be a communication module that performs data communication with another device.

메모리(110)는 발화자의 연령대를 인식하는 프로그램이 저장된다. 이때, 메모리(110)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치를 통칭하는 것이다. The memory 110 stores a program for recognizing the age of the speaker. At this time, the memory 110 collectively refers to a non-volatile storage device that keeps stored information even when no power is supplied, and a volatile storage device that requires power to maintain stored information.

프로세서(120)는 메모리(110)에 저장된 프로그램의 실행에 따라, 발화자의 음성 신호로부터 특징 요소를 추출한다.The processor 120 extracts the feature element from the speech signal of the speaker according to the execution of the program stored in the memory 110. [

이때, 음성 신호는 앞에서 설명한 바와 같이, 입력 모듈(130)을 통하여 입력된 것일 수 있다. 또한, 음성 신호는 연령대 인식 장치(100)에 포함된 스토리지 모듈(미도시)에 저장된 것이거나, 다른 장치를 통하여 수집된 후 입력 모듈(130)을 통하여, 연령대 인식 장치(100)로 전달된 것일 수 있으나, 이에 한정된 것은 아니다.At this time, the audio signal may be input through the input module 130 as described above. The voice signal may be stored in a storage module (not shown) included in the age recognition device 100 or collected through another device and then transmitted to the age recognition device 100 through the input module 130 But is not limited thereto.

또한, 음성 신호는 복수의 발화자로부터 수신한 것일 수 있다. 예를 들어, 음성 신호는 두 명 또는 두 명 이상의 발화자가 서로 전화를 통하여 대화를 하는 전화 통화 음성 신호를 전화기와 연결된 입력 모듈(130)을 통하여 수집한 것일 수 있다. 음성 신호는 전화기에 포함된 녹음기를 통하여 수집된 입력 모듈(130)을 통하여, 전달된 것일 수 있으나, 이에 한정되는 것은 아니다. Further, the voice signal may be received from a plurality of speakers. For example, the voice signal may be a voice call signal through which two or more speakers communicate with each other via a telephone call through an input module 130 connected to the telephone. The voice signal may be, but is not limited to, transmitted through the input module 130 collected via a voice recorder included in the telephone.

발화형태적 특징 요소는 발화자의 망설임 특징 요소 및 대화 반응 지연 특징 요소 중 적어도 하나 이상이 될 수 있다. The spoken morphological feature element may be at least one of the resident feature element of the speaker and the interactive response delay feature element.

망설임은 발화자가 대화 도중 발화를 멈추는 현상을 에 기초한 것이다. 그러므로 망설임 특징 요소는 해당 발화자가 발화하는 도중 발화를 멈추는 시간 또는 횟수에 기초하여 추출될 수 있다.The hesitation is based on the fact that the speaker stops speaking during the conversation. Therefore, the hesitation feature element can be extracted on the basis of the time or the number of times the utterance stops speaking during the utterance.

또한, 대화 반응 지연은 다른 발화자가 대화를 종료한 이후 해당 발화자가 발화를 시작하기 전 지연이 발생하는 현상에 기초하여 추출되는 것이다. 그러므로 대화 반응 지연은 다른 발화자의 발화 종료 시각 및 해당 발화자의 발화가 시작된 시간의 차에 기초하여 산출될 수 있다. 또는 대화 반응 지연은 다른 발화자의 질의에 대응하여, 해당 발화자의 답변이 시작되는 시간적 차이에 기초하여 산출될 수 있다. Further, the dialog response delay is extracted based on the phenomenon that a delay occurs before the talker starts speaking after another talker finishes the conversation. Therefore, the dialog response delay can be calculated based on the difference between the ignition end time of the other talker and the time at which the ignition of the corresponding talker is started. Or conversational response delay may be calculated based on a temporal difference in which a response of the corresponding speaker starts, corresponding to a query of another speaker.

예를 들어, 망설임 또는 대화 반응 지연은 어린이나 노인에게 발생할 확률이 높다. 그러므로 프로세서(120)는 발화자의 음성 신호에 포함된 발화형태적 특징 요소에 대응하는 시간, 확률 또는 발생 횟수에 기초하여 해당 발화자의 연령대를 판별할 수 있다.For example, delays or delayed conversations are more likely to occur in children or the elderly. Therefore, the processor 120 can determine the age range of the utterance based on the time, probability, or frequency of occurrence corresponding to the utterance type feature element included in the speech signal of the utterance.

구체적으로 프로세서(120)는 수신된 음성 신호로부터 복수의 발화자를 판별할 수 있다. 또한, 프로세서(120)는 판별된 복수의 발화자 중 연령대 판별 대상이 되는 발화자의 음성 신호로부터 특징 요소를 추출할 수 있다.Specifically, the processor 120 can determine a plurality of speakers from the received voice signal. In addition, the processor 120 can extract the feature element from the speech signal of the speaker, which is the target of the age discrimination among the plurality of identified speakers.

프로세서(120)는 추출된 발화형태적 특징 요소 및 연령대 분류 모델에 기초하여 해당 발화자의 연령대를 판별할 수 있다. 이하에서는 도 2를 참조하여, 연령대 판별 과정을 상세하게 설명한다.The processor 120 can determine the age range of the speaker based on the extracted speech feature elements and the age classification model. Hereinafter, the age determination process will be described in detail with reference to FIG.

도 2는 본 발명의 일 실시예에 따른 연령대 인식 프로그램의 블록도이다.2 is a block diagram of an age recognition program according to an embodiment of the present invention.

프로세서(120)는 입력 모듈(130)을 통해 제 1 발화자 및 제 2 발화자의 대화를 포함하는 음성 신호(270)를 수신할 수 있다. 프로세서(120)는 전처리 모듈(210)을 통해 음성 신호에서 제 1 발화자의 음성 신호 및 제 2 발화자의 음성 신호를 분리할 수 있다. The processor 120 may receive the speech signal 270 including the conversation of the first and second talkers via the input module 130. [ The processor 120 may separate the speech signal of the first speaker and the speech signal of the second speaker from the speech signal through the preprocessing module 210.

또한, 프로세서(120)는 전처리 모듈(210)을 통해, 제 1 발화자 및 제 2 발화자의 음성 신호(270)에 대한 전처리를 수행할 수 있다. 이때, 전처리는 음성 신호에 대한 잡음을 제거하는 것이 될 수 있다. 또는, 전처리는 입력되는 음성 신호에서 연령대 판별을 위한 단위 길이의 음성 신호를 추출하는 것일 수 있으나, 이에 한정되는 것은 아니다. The processor 120 may also perform preprocessing on the speech signal 270 of the first and second speakers through the preprocessing module 210. At this time, the preprocessing may be to remove noise from the speech signal. Alternatively, the preprocessing may be performed by extracting a voice signal having a unit length for discriminating the age from the input voice signal, but is not limited thereto.

그리고 프로세서(120)는 구간 분할 모듈(220)을 통해 전처리된 제 1 발화자의 음성 신호를 미리 정해진 길이 또는 미리 정해진 개수에 따라 복수의 구간으로 분할할 수 있다. 예를 들어, 미리 정해진 길이는 '100 ms'가 될 수 있다. 그러므로 프로세서(120)는 음성 신호를 100ms 단위로 분할할 수 있다. 또는, 미리 정해진 개수는 16개가 될 수 있다. 그러므로 프로세서(120)는 음성 신호를 16개의 구간으로 분할할 수 있다. 미리 정해진 길이 및 미리 정해진 개수는 이에 한정된 것은 아니다.The processor 120 may divide the speech signal of the first speaker, which has been preprocessed through the section division module 220, into a plurality of sections according to a predetermined length or a predetermined number. For example, the predetermined length may be '100 ms'. Therefore, the processor 120 can divide the voice signal into 100 ms units. Alternatively, the predetermined number may be 16. Therefore, the processor 120 can divide the speech signal into 16 sections. The predetermined length and the predetermined number are not limited thereto.

프로세서(120)는 분할된 복수의 구간 각각에 대하여 발화형태적 특징 요소를 추출한다. 이때, 발화형태적 특징 요소는 망설임 시간 추출 모듈(230)을 통하여 추출되는 망설임 및 대화 반응 시간 추출 모듈(240)을 통하여 추출되는 대화 반응 지연 중 하나 이상이 될 수 있다.The processor 120 extracts the uttered morphological feature elements for each of the plurality of sections. In this case, the uttered morphological feature element may be at least one of a hesitation extracted through the hesitation time extraction module 230 and an interactive response delay extracted through the response time extraction module 240.

앞에서 설명한 바와 같이, 망설임 특징 요소는 발화자가 발화하는 도중 발화를 멈추는 현상에 기초하여 추출되는 것이다. 그러므로 발화자의 망설임 특징 요소는 발화자가 발화하는 도중 발화를 멈추는 현상에 기초하여 추출된다. 즉, 발화자의 망설임 특징 요소는 발화자의 발화 중 발화를 멈추는 시간 또는 발화를 멈추는 행위의 횟수 등을 이용하여 산출될 수 있다.As described above, the hesitation feature element is extracted based on the phenomenon that the igniter stops the ignition while igniting. Therefore, the resident feature element of the speaker is extracted based on the phenomenon that the speaker stops the ignition during the ignition. That is, the resident characteristic element of the speaker can be calculated by using the time to stop the utterance or the number of times to stop utterance during the utterance of the utterer.

예를 들어, 프로세서(120)는 망설임 시간 추출 모듈(230)을 통하여, 복수의 구간에 포함된 각각의 구간에 대하여 발화 멈춤의 포함 여부를 판단할 수 있다. 이때, 프로세서(120)는 해당 구간에 일정 시간 이상 음성 신호가 입력되지 않거나, 일정 시간 이상 허용 수치 미만의 데시벨의 잡음만이 입력되는 경우, 해당 구간에 멈춤이 포함되었다고 판단할 수 있다.For example, the processor 120 may determine whether or not a pause is included in each section included in a plurality of sections through the pause time extraction module 230. [ At this time, if the processor 120 does not input a voice signal for a predetermined time or a noise of a decibel less than the allowable value for a predetermined time or longer, the processor 120 can determine that the stop is included in the corresponding interval.

그리고 프로세서(120)는 멈춤이 포함된 구간에 대한 멈춤 발생 빈도를 산출할 수 있다. 프로세서(120)는 복수의 구간 각각에 대한 멈춤 발생 빈도를 발화자의 망설임 특징 요소로서 추출할 수 있다. Then, the processor 120 may calculate the frequency of occurrence of a pause with respect to the section including the pause. The processor 120 may extract the stop occurrence frequency for each of the plurality of sections as a resident feature element of the speaker.

또는, 프로세서(120)는 복수의 구간에 포함된 각각에 대하여, 발화 멈춤에 의해 발생된 묵음의 길이를 산출할 수 있다. 프로세서(120)는 산출된 묵음 길이를 발화자의 망설임 특징 요소로서 추출할 수 있다.Alternatively, the processor 120 may calculate the length of the silence generated by the pause for each of the plurality of sections. The processor 120 may extract the calculated silence length as a resident feature element of the speaker.

이 외에도 프로세서(120)는 복수의 구간에 포함된 각각의 구간으로부터 산출되는 묵음 시간의 통계치로부터 발화자의 망설임 특징 요소를 추출할 수 있다. 이때, 통계치는 해당 특징 요소에 대한 발생 빈도, 발생 빈도의 합계, 평균 및 분산 등이 될 수 있다. 또한, 통계치는 해당 특징 요소에 대한 발행 확률 등이 될 수 있으나, 이에 한정된 것은 아니다. In addition, the processor 120 may extract the suspicious feature elements of the speaker from the statistics of the silence time calculated from the respective sections included in the plurality of sections. At this time, the statistics may be the frequency of occurrence, frequency of occurrence, average and variance of the feature elements. Also, the statistics may be, but are not limited to, the probability of issuing the feature elements.

한편, 앞에서 설명한 바와 같이, 대화 반응 지연은 대화 중 다른 발화자가 대화를 종료한 이후 해당 발화자가 발화를 시작하기 전 지연이 발생하는 현상에 기초하여 추출되는 것이다. 그러므로 대화 반응 지연은 발화 시작 전의 지연 시간에 기초하여 산출될 수 있다.On the other hand, as described above, the dialog response delay is extracted on the basis of a phenomenon in which a delay occurs before the talker starts speaking after another talker finishes the conversation. Therefore, the dialog response delay can be calculated based on the delay time before the start of utterance.

예를 들어, 프로세서(120)는 대화 반응 시간 추출 모듈(240)을 통하여, 제 2 발화자가 발화를 종료한 이후, 제 1 발화자의 대화 시작 시각에 기초하여 발화 전 지연 시간의 포함 여부를 판단할 수 있다. 즉, 프로세서(120)는 제 2 발화자의 발화 종료 시각과 제 1 발화자의 대화 시작 시각의 차가 미리 정해진 값 이내인 경우, 이를 제 1 발화자의 대화 지연 특징 요소로서 추출할 수 있다. For example, the processor 120 determines whether or not the pre-ignition delay time is included in the conversation response time extraction module 240 based on the conversation start time of the first speaking person after the second speaking person finishes speaking . That is, when the difference between the speaking end time of the second speaking person and the speaking start time of the first speaking person is within a predetermined value, the processor 120 can extract it as the speaking delay characteristic element of the first speaking person.

또는, 대화 반응 지연은 다른 발화자가 질의에 대응하여, 해당 발화자가 응답을 시작하기 전 지연이 발생하는 현상에 기초하여 추출되는 것일 수 있다.Alternatively, the dialog response delay may be extracted based on the phenomenon that another speaker responds to the query and a delay occurs before the speaker starts to respond.

예를 들어, 프로세서(120)는 질의 및 응답 추출 모듈(미도시)을 통하여, 음성 신호로부터 제 2 발화자의 질의 및 제 2 발화자의 질의에 대응하는 제 1 발화자의 응답을 추출할 수 있다. 그리고 프로세서(120)는 대화 반응 시간 추출 모듈(240)을 통하여 제 2 발화자의 질의에 대한 제 1 발화자의 응답의 발화 전 지연의 포함 여부를 판단할 수 있다. 즉, 프로세서(120)는 제 2 발화자의 질의 종료 시각과 이에 대응하는 제 1 발화자의 응답 시작 시각의 차가 미리 정해진 값 이내인 경우 이를 제 1 발화자의 대화 지연 특징 요소로서 추출할 수 있다.For example, the processor 120 may extract the first speaker's query from the voice signal and the first speaker's response corresponding to the second speaker's query, via a query and response extraction module (not shown). Then, the processor 120 may determine whether or not the delay of the first speaker's response to the query of the second speaker through the dialog response time extraction module 240 is included. That is, the processor 120 may extract the difference between the query end time of the second talker and the response start time of the first talker corresponding thereto within a predetermined value as the talk delay characteristic element of the first talker.

이와 같이, 대화 반응 시간 추출 모듈(240)을 통하여, 복수의 구간에 포함된 각각의 구간에 대하여 제 1 발화자의 발화 전 지연의 포함 여부를 판단할 수 있다. 그리고 프로세서(120)는 제 1 발화자의 발화 전 지연이 포함된 구간에 대한 대화 지연의 발생 빈도를 산출할 수 있다. 프로세서(120)는 각각의 구간에 대한 대화 지연의 발생 빈도를 제 1 발화자의 대화 지연 특징 요소로서 추출할 수 있다.As described above, the speech response time extraction module 240 can determine whether or not the delay of the first speech generator before the speech is included in each section included in the plurality of sections. The processor 120 may calculate the occurrence frequency of the talk delay with respect to the section including the pre-ignition delay of the first speaker. The processor 120 may extract the occurrence frequency of the talk delay for each section as the talk delay characteristic element of the first talker.

또는, 프로세서(120)는 각각의 구간의 대화 지연 시간의 길이를 산출할 수 있다. 프로세서(120)는 각각의 구간에 대한 대화 지연 시간의 길이를 제 1 발화자의 대화 지연 특징 요소로 추출할 수 있다. Alternatively, the processor 120 may calculate the length of the talk delay time of each section. The processor 120 may extract the length of the talk delay time for each section as the talk delay characteristic element of the first talker.

이 외에도 프로세서(120)는 각각의 구간으로부터 산출되는 대화 지연 시간의 통계치에 기초하여 제 1 발화자의 대화 지연 특징 요소를 추출할 수 있다. 이때, 통계치는 앞에서 설명한 바와 같이, 해당 특징 요소에 대한 발생 빈도, 발생 빈도의 합계, 평균 및 분산, 발생 확률 등이 될 수 있으나, 이에 한정된 것은 아니다.In addition, the processor 120 may extract the dialog delay characteristic element of the first speaker based on the statistics of the talk delay time calculated from each section. In this case, as described above, the statistics may include, but are not limited to, frequency of occurrences, frequency of occurrence, average and variance, probability of occurrence, and the like of the feature elements.

이와 같이, 프로세서(120)는 망설임 시간 추출 모듈(230) 또는 대화 반응 시간 추출 모듈(240)을 통하여 제 1 발화자의 특징 요소를 추출할 수 있다. 그리고 프로세서(120)는 연령대 판별 모듈(250)을 통하여 제 1 발화자에 대응하여 추출된 특징 요소를 이용하여 해당 발화자의 연령대를 판별할 수 있다.In this way, the processor 120 may extract the feature elements of the first talker through the dwell time extraction module 230 or the dialog response time extraction module 240. The processor 120 may determine the age range of the corresponding speaker using the extracted feature elements corresponding to the first speaker through the age determination module 250.

한편, 프로세서(120)는 발화자의 연령대를 판별하기 위하여, 기학습된 연령대 분류 모델(260)을 이용할 수 있다.Meanwhile, the processor 120 may use the learned age classification model 260 to determine the age of the speaker.

예를 들어, 프로세서(120)는 연령대 분류 모델(260)을 생성하기 위하여, 연령대 별로 복수의 음성 신호를 수집할 수 있다.For example, the processor 120 may collect a plurality of voice signals for each age group to generate an age classification model 260.

프로세서(120)는 연령대별로 수집된 학습 음성 신호로부터 망설임 시간 추출 모듈(230) 또는 대화 반응 시간 추출 모듈(240)을 이용하여 학습 음성 신호에 대응하는 특징 요소를 추출할 수 있다.The processor 120 may extract feature elements corresponding to the learning speech signal using the retention time extraction module 230 or the interactive response time extraction module 240 from the learning speech signals collected for each age group.

프로세서(120)는 연령대 별로 추출된 특징 요소에 기초하여 연령대를 분류할 수 있는 연령대 분류 모델(260)을 생성할 수 있다.The processor 120 may generate an age group classification model 260 that can classify the age groups based on the feature elements extracted for each age group.

이때, 연령대 분류 모델(260)은 각 구간 별 특징 요소의 값 및 각 구산 별 특징 요소의 값에 대응하는 연령대를 정의한 복수의 규칙을 포함하는 것일 수 있다. At this time, the age classifying model 260 may include a plurality of rules defining age groups corresponding to the values of the feature elements for each section and the values of the feature elements for each base.

예를 들어, 연령대 분류 모델(260)에 포함되는 규칙은 "IF특징 요소: 망설임, 발생 확률: {제 1 구간 - 0.05, 제 2 구간 - 0.1, , 제 16 구간 - 0.01} Then 연령대: 20대"일 수 있다. 이때, 규칙의 의미는 "제 1 구간의 망설임 특징 요소의 발생 확률이 0.05, 제 2 구간의 망설임 특징 요소의 발생 확률이 0.1, , 제 16 구간의 제 2 구간의 망설임 특징 요소의 발생 확률이 0.01이면, 연령대는 20대이다"가 될 수 있다. For example, the rule included in the age group classification model 260 is "IF characteristic element: hesitation, occurrence probability: {first section - 0.05, second section - 0.1, " In this case, the meaning of the rule is that "the probability of occurrence of the feature elements of the first section is 0.05, the probability of occurrence of the feature elements of the second section is 0.1, the probability of occurrence of the feature elements of the second section of the 16th section is 0.01 , The age group is in the twenties ".

또는, 연령대 분류 모델(260)에 포함되는 규칙은 "IF 특징 요소: 대화 반응 지연, 발생 횟수: 1회 이상 Then 연령대: 60대" 일 수 있다. 이때, 규칙의 의미는 "대화 반응 지연이 1회 이상 발생한 경우, 해당 발화자의 연령대는 60대이다"가 될 수 있다. Alternatively, the rule included in the age group classification model 260 may be "IF characteristic element: delayed conversation reaction, occurrence frequency: at least one Then age group: 60 generations ". In this case, the meaning of the rule is "when the response delay occurs more than once, the age range of the corresponding speaker is 60".

또한, 연령대 분류 모델(260)은 추출된 특징 요소의 패턴 분석 및 통계 분석에 기초하여 생성되거나, 분류 알고리즘(classification algorithm)에 기초하여 생성된 것일 수 있으나, 이에 한정된 것은 아니다.In addition, the age classification model 260 may be generated based on pattern analysis and statistical analysis of extracted feature elements, or may be generated based on a classification algorithm, but is not limited thereto.

이때, 통계 분석은 회귀 분석(regression analysis) 및 상관 분석(correlation analysis) 등이 될 수 있다. 또한, 분류 알고리즘은 신경망 알고리즘(neural network algorithm), 지지 벡터 머신 알고리즘(support vector machine algorithm), k-근접 이웃 알고리즘(k-nearest neighbor algorithm) 및 의사결정트리 알고리즘(decision tree algorithm) 등이 될 수 있다. 또한, 의사결정트리 알고리즘은 ID3(interactive Dichotomiser 3) C4.5, C5.0, CART(classification and regression tree) 및 CHAID(Chi-squared automatic interaction detector) 등이 될 수 있으나, 이에 한정된 것은 아니다.At this time, the statistical analysis may be regression analysis and correlation analysis. The classification algorithm may be a neural network algorithm, a support vector machine algorithm, a k-nearest neighbor algorithm, and a decision tree algorithm. have. The decision tree algorithm may be, but is not limited to, ID3 (interactive Dichotomizer 3) C4.5, C5.0, classification and regression tree (CART), and Chi-squared automatic interaction detector (CHAID).

그리고 프로세서(120)는 생성된 연령대 분류 모델(260)을 스토리지 모듈(미도시) 또는 데이터베이스(미도시)에 저장할 수 있다.The processor 120 may then store the generated age classifier model 260 in a storage module (not shown) or a database (not shown).

다음은 도 3을 참조하여, 본 발명의 일 실시예에 따른 연령대 인식 장치(100)의 연령대 인식 방법을 설명한다.Next, with reference to FIG. 3, an age recognition method of the age recognition apparatus 100 according to an embodiment of the present invention will be described.

도 3은 본 발명의 일 실시예에 따른 연령대 인식 방법의 순서도이다.3 is a flowchart of an age recognition method according to an embodiment of the present invention.

연령대 인식 장치(100)는 발화자의 음성 신호로부터 발화형태적 특징 요소를 추출한다(S300). 이때, 발화형태적 특징 요소는 발화자의 망설임 및 대화 반응 지연 중 적어도 하나 이상 일 수 있다.The age-recognizing device 100 extracts an utterance morphological feature element from the speech signal of the speaker (S300). At this time, the utterance morphological feature element may be at least one of a hesitation of the speaker and a delay in the conversation reaction.

망설임은 음성 신호에 포함된 발화자의 발화 중 멈춤 시간에 기초하여 산출될 수 있다. The hesitation can be calculated based on the stopping time of the utterance included in the speech signal.

또한, 대화 반응 지연은 음성 신호에 포함된 타 발화자의 발화 종료 시각 및 발화자의 발화 시작 시각의 차에 기초하여 산출될 수 있다. 예를 들어, 대화 반응 지연은 음성 신호에 포함된 타 발화자의 질의 및 타 발화자의 질의에 대응하는 해당 발화자의 응답을 통하여 추출할 수 있다. 즉, 연령대 인식 장치(100)는 타 발화자의 질의 종료 시각 및 해당 질의에 대응하는 발화자의 응답 시작의 차에 기초하여, 대화 반응 지연을 산출할 수 있다.Further, the dialog response delay can be calculated based on the difference between the utterance end time of the other utterer included in the voice signal and the utterance utterance start time. For example, the speech response delay can be extracted through the query of the other speaker included in the voice signal and the response of the corresponding speaker corresponding to the query of the other speaker. That is, the age recognition apparatus 100 can calculate the dialog response delay based on the query end time of the other speaker and the difference between the start of the response of the speaker corresponding to the query.

그리고 연령대 인식 장치(100)는 추출된 특징 요소 및 연령대 분류 모델에 기초하여 발화자의 연령대를 판별한다(S310). 이때, 연령대 분류 모델은 연령대 별로 수집된 복수의 음성 신호에 기초하여 생성된 것이다.Then, the age recognition apparatus 100 determines the age range of the speaker based on the extracted feature elements and the age classification model (S310). At this time, the age group classification model is generated based on a plurality of voice signals collected for each age group.

한편, 연령대 인식 장치(100)는 연령대 별로 수집된 복수의 음성 신호에 기초하여 연령대 별 특징 요소를 추출할 수 있다. 그리고 연령대 인식 장치(100)는 연령대 별로 추출된 특징 요소를 이용하여 연령대 분류 모델을 생성할 수 있다.On the other hand, the age recognition apparatus 100 can extract feature elements for each age group based on a plurality of voice signals collected for each age group. Then, the age recognition apparatus 100 can generate an age classification model using the feature elements extracted for each age group.

또한, 발화자의 음성 신호는 입력 모듈(130)을 통하여 수집되는 것일 수 있다. 그러므로 연령대 인식 장치(100)는 발화자의 음성 신호를 수신하고, 수신된 음성 신호에 대하여, 발화자의 연령대를 판별할 수 있다.In addition, the speech signal of the speaker may be collected through the input module 130. Therefore, the age recognizing apparatus 100 can receive the voice signal of the speaker and discriminate the age range of the speaker with respect to the received voice signal.

본 발명의 일 실시예에 따른 연령대 인식 장치(100) 및 방법은 발화자의 음성 신호로부터 추출되는 망설임 및 대화 반응 지연 등의 발화형태적 특징 요소에 기초하여 생성된 연령대 분류 모델을 이용하여 사용자의 연령대를 인식할 수 있다. 그러므로 연령대 인식 장치(100) 및 방법은 미리 저장된 개인 정보 없이 연령대별로 분석된 발화형태적 특징 요소에 기초하여 발화자의 연령대를 판단할 수 있다.The age-recognizing apparatus 100 and method according to an embodiment of the present invention use age-group classification models generated based on utterance morphological features such as hesitation and dialog response delay extracted from a voice signal of a speaker, Can be recognized. Therefore, the age-recognizing apparatus 100 and the method can determine the age range of a speaking person based on the spoken form characteristic elements analyzed for each age group without previously stored personal information.

또한, 연령대 인식 장치(100) 및 방법은 발화자의 음성으로부터 인구통계학적 정보를 자동으로 추출하고 인식할 수 있으므로, 각종 전자기기의 개인 맞춤형 서비스에 활용이 가능하다. In addition, the age-recognizing apparatus 100 and method can extract and recognize demographic information automatically from the voice of a speaker, and thus can be utilized in personalized services of various electronic apparatuses.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. One embodiment of the present invention may also be embodied in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. The computer-readable medium may also include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.While the methods and systems of the present invention have been described in connection with specific embodiments, some or all of those elements or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100: 연령대 인식 장치
110: 메모리
120: 프로세서
130: 입력 모듈100: Age recognition device
110: Memory
120: Processor
130: input module

Claims

A device for recognizing an age of a speaker,
A memory for storing a program for recognizing the ages of the speakers and
And a processor for executing a program stored in the memory,
Wherein the processor extracts an utterance type feature element of the utterer from the speech signal of the utterer according to the execution of the program and discriminates the age range of the utterance based on the extracted utterance type feature element and the age group classification model,
The age group classification model is generated based on a plurality of voice signals collected for each age group
Wherein the spoken morphological feature element comprises at least one of a hesitation of the talker and a delay in an interactive response.

The method according to claim 1,
Wherein the processor calculates the hesitation based on a stopping time of the utterance included in the voice signal.

The method according to claim 1,
Wherein the processor calculates the conversation reaction delay based on a difference between an utterance end time of another speaker included in the voice signal and a utterance start time of the utterer.

The method of claim 3,
Wherein the processor extracts a query of the other speaker included in the voice signal and a response of the speaker corresponding to the query,
And calculates the dialog response delay based on the end time of the query and the start time of the response.

The method according to claim 1,
Wherein the processor extracts feature elements for each of the age groups from a plurality of voice signals collected for each of the age groups and generates the age group classification model using the feature elements extracted for each of the age groups.

The method according to claim 1,
Further comprising an input module for receiving a voice signal,
Wherein the processor determines an age range of a speaker corresponding to the voice signal with respect to the voice signal of the speaker received through the input module.

A method for recognizing an age of a speaker of a recognizing device,
Extracting a speech feature element of the speech from the speech signal of the speaker; And
And determining the age range of the utterance based on the extracted uttered morphological feature element and the age range classification model,
The age group classification model is generated based on a plurality of voice signals collected for each age group,
Wherein the spoken morphological feature element comprises at least one of a hesitation of the talker and a delay in an interactive response.

8. The method of claim 7,
Wherein the step of extracting the firing morphological feature element comprises:
And calculates the hesitation based on the stopping time of the utterance included in the voice signal.

8. The method of claim 7,
Wherein the step of extracting the firing morphological feature element comprises:
And the speech response delay is calculated based on a difference between an utterance end time of another speaker included in the voice signal and a utterance start time of the utterer.

10. The method of claim 9,
Wherein the step of extracting the firing morphological feature element comprises:
Extracting a query of a different speaker included in the voice signal and a response of the speaker corresponding to the query; And
And calculating the conversation response delay based on the end time of the query and the start time of the response.

8. The method of claim 7,
Before the step of extracting the firing morphological feature elements,
Extracting feature elements for each of the age groups based on a plurality of voice signals collected for each of the age groups; And
And generating the age group classification model using the feature elements extracted for each of the age groups.

8. The method of claim 7,
Before the step of extracting the firing morphological feature elements,
Further comprising the step of receiving the speech signal of the speaking person through an input module.