KR20210000802A

KR20210000802A - Artificial intelligence voice recognition processing method and system

Info

Publication number: KR20210000802A
Application number: KR1020190075833A
Authority: KR
Inventors: 황호연
Original assignee: 희성전자 주식회사
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2021-01-06

Abstract

The present invention relates to an artificial intelligence voice recognition method and a system thereof, capable of processing automatic response with a voice of an opposite sex different from a voice of a user based on gender recognition of a voice of a user. More specifically, the present invention relates to the artificial intelligence recognition method comprising the steps of: detecting a voice of a user; analyzing the meaning of the voice and the gender and age of the user through the detected voice of the user; automatically selecting a preset response mode according to the analyzed gender and age of the user; and outputting a response corresponding to the meaning of the voice in the selected response mode. When selecting the response mode, a voice of the opposite sex different from the gender of the user is selected.

Description

Artificial intelligence speech recognition processing method and system {ARTIFICIAL INTELLIGENCE VOICE RECOGNITION PROCESSING METHOD AND SYSTEM}

본 발명은 인공지능 음성 인식 처리 방법 및 시스템에 관한 것으로서, 특히, 사용자 목소리의 성별 인식을 기반으로 사용자의 목소리와 다른 이성의 목소리로 자동 응대 처리가 가능하도록 구성한 인공지능 음성 인식 처리 방법 및 시스템에 관한 것이다.The present invention relates to an artificial intelligence speech recognition processing method and system, and in particular, to an artificial intelligence speech recognition processing method and system configured to enable automatic response processing with a voice of a reason different from the user's voice based on the gender recognition of the user's voice. About.

일반적으로, 인공지능(AI : Artificial Intelligence) 음성 인식을 통한 응답(비서) 목소리는 대부분 여성의 목소리로 출력된다.In general, the response (secretary) voice through artificial intelligence (AI) voice recognition is mostly output as female voices.

인공지능은 기본적으로 성별이 없지만 이름이나 목소리에 의해 성별이 구분 된다. 대부분의 음성인식 비서의 이름이 “시리”, “코타나“, 알렉사“ 등과 같은 여성 이름을 갖고 있다. 한편, 변호사 업무나 고차원적 업무 대응의 AI는 “왓슨”, “로스＂등과 같은 남성 이름을 갖고 있다.Artificial intelligence basically has no gender, but gender is classified by name or voice. Most voice recognition assistants have female names such as "Siri", "Cortana", and Alexa. On the other hand, AI for lawyer work and high-level work response has male names such as "Watson" and "Ross."

AI는 성별이 없는 상태에서 시작하지만, 어떤 데이터가 쌓이고, 알고리즘이 어떻게 구성되느냐에 따라 성별이 지어질 수 있다.AI starts with no gender, but depending on what data is accumulated and how the algorithm is structured, gender can be built.

이와 같은 AI 비서로 남성에 비해 여성의 목소리가 많은 것은 남성의 목소리 보다 여성의 목소리가 편안하다는 일부 연구 결과를 인용한 것이나, 일각에서는 성 역할에 대한 고정관념 반영의 결과라고 주장하면서 사회적 성 차별 이슈 및 고정관념 강화의 문제점으로 대두되었다.The fact that such AI assistants have more female voices than males cited some research findings that female voices are more comfortable than male voices, but some argued that it was the result of reflecting stereotypes about gender roles, and the issue of social gender discrimination. And it emerged as a problem of strengthening stereotypes.

즉, AI 음성 비서에게 여성의 이름과 목소리를 사용하는 것은 나쁜 뜻이 없더라도 이와 같은 고정관념을 강화시키는 계기가 되고 있으며, 의료, 법률, 퀴즈 분야에서는 똑똑한 우월적 남성 목소리를 주로 사용함으로써 성적 차별화를 더욱 부추기고 있다.In other words, the use of female names and voices for AI voice assistants is an opportunity to reinforce these stereotypes, even if there is no bad meaning, and in medical, legal, and quiz fields, sexual differentiation is further enhanced by using smart, superior male voices. It is encouraging.

현재, 제조사가 AI 음성 비서의 이름(호출어) 및 성별을 결정하여 제품을 출시하고 있으며, 목소리 또한 정해진 성별 목소리로 들려주고 있다. 따라서, 다양한 사용자의 성별 및 연령에 따른 호감형 AI 음성 비서의 목소리를 반영하지 못하고 있는 실정이다.Currently, the manufacturer is releasing the product by determining the name (calling language) and gender of the AI voice assistant, and the voice is also heard in the specified gender voice. Therefore, it is not possible to reflect the voices of favorable AI voice assistants according to gender and age of various users.

도 1은 종래 일반적인 인공지능 음성 인식 시스템의 구성도를 나타낸다.1 shows a configuration diagram of a conventional general artificial intelligence speech recognition system.

대부분의 음성인식 시스템은 도 1과 같은 구성을 가지고 있으며, 화자(10)의 입력신호(11) 중에서, 실제 사람이 발성한 음성신호만 검출하여, 특징을 추출(30)하고, 기준 음향 모델과(40)의 유사도를 측정해 패턴을 분류(50)하며, 언어모델(60) 기반으로 언어로서 처리(70)하여 최종 문장으로 인식하는 원리이다.Most of the speech recognition systems have the configuration as shown in FIG. 1, and among the input signals 11 of the speaker 10, only the speech signals uttered by a real person are detected, features are extracted (30), and the reference acoustic model and It is a principle of classifying the pattern (50) by measuring the similarity of (40), and processing it as a language (70) based on the language model (60) to recognize it as a final sentence.

이와 같은 기술은 음성인식 오류를 개선하는 부분에만 초점이 맞추어져 있으며, 인식 문장을 통해 대화처리의 데이터베이스에 저장된 언어를 생성 및 합성하여 스피커를 통해 음성으로 출력하는 시스템으로 구성된다. 즉, 특징추출(30)에서 단순히 화자의 성별만을 인식하고 패턴분류 및 언어처리를 하는 알고리즘으로 음성 인식률을 높이는 기술 자체에만 초점이 맞춰져 있다.This technology focuses only on the part of improving speech recognition errors, and consists of a system that generates and synthesizes a language stored in a database of conversation processing through recognition sentences, and outputs it as a voice through a speaker. That is, the feature extraction 30 simply recognizes only the gender of the speaker, and focuses only on the technology itself that increases the speech recognition rate with an algorithm that classifies patterns and processes language.

(0001) 국내공개특허 제10-2019-0026518호(0001) Korean Patent Publication No. 10-2019-0026518 (0002) 국내등록특허 제10-0806025호(0002) Domestic registration patent No. 10-0806025

본 발명이 해결하고자 하는 기술적 과제는, 사용자의 목소리를 통해 파악된 성별을 기반으로 사용자와 다른 성별(이성)의 목소리로 응대하도록 구성한 인공지능 음성 인식 처리 방법 및 시스템을 제공하는데 있다.The technical problem to be solved by the present invention is to provide an artificial intelligence speech recognition processing method and system configured to respond with a voice of a different gender (the opposite sex) from the user based on the gender identified through the user's voice.

본 발명이 해결하고자 하는 다른 기술적 과제는, 상기와 같이 사용자의 성별은 물론 연령대에 따라서도 호감도를 갖는 연령대의 목소리로 응대하도록 구성한 인공지능 음성 인식 처리 방법 및 시스템을 제공하는데 있다.Another technical problem to be solved by the present invention is to provide an artificial intelligence speech recognition processing method and system configured to respond with a voice of an age group having favorable sensitivity depending on the age group as well as the gender of the user as described above.

상기 기술적 과제를 달성하기 위한 본 발명인 인공지능 음성 인식 처리 방법은, 사용자의 음성을 검출하는 단계; 상기 검출된 사용자의 음성을 통해 음성의 의미 및 사용자의 성별과 연령대를 분석하는 단계; 상기 분석된 사용자의 성별과 연령대에 따라 기설정된 응답모드를 자동으로 선택하는 단계; 및 상기 선택된 응답모드로 상기 음성의 의미에 대응되는 응답을 출력하는 단계;를 포함하며, 상기 응답모드 선택시 사용자의 성별과 다른 이성의 목소리를 선택하는 것을 특징으로 한다.An artificial intelligence speech recognition processing method according to the present invention for achieving the above technical problem comprises: detecting a user's speech; Analyzing the meaning of the voice and the user's gender and age group through the detected user's voice; Automatically selecting a preset response mode according to the analyzed user's gender and age; And outputting a response corresponding to the meaning of the voice in the selected response mode, wherein when the response mode is selected, a voice of a reason different from the user's gender is selected.

또한, 본 발명인 인공지능 음성 인식 처리 시스템은, 사용자의 음성을 검출하는 음성검출부; 상기 음성검출부를 통해 검출된 사용자의 음성으로부터 음성의 의미 및 사용자의 성별과 연령대를 분석하는 음성분석부; 상기 음성분석부를 통해 분석된 사용자의 성별과 연령대에 따라 기설정된 응답모드를 자동으로 선택하는 음성처리부; 및 상기 선택된 응답모드로 상기 음성의 의미에 대응되는 응답을 출력하는 음성출력부;를 포함하며, 상기 응답모드 선택시 사용자의 성별과 다른 이성의 목소리를 선택하는 것을 특징으로 한다.In addition, the artificial intelligence speech recognition processing system of the present invention, a voice detection unit for detecting a user's voice; A voice analysis unit that analyzes the meaning of the voice and the user's gender and age group from the user's voice detected through the voice detection unit; A voice processing unit that automatically selects a preset response mode according to the gender and age group of the user analyzed through the voice analysis unit; And a voice output unit that outputs a response corresponding to the meaning of the voice in the selected response mode, wherein when the response mode is selected, a voice of a reason different from the user's gender is selected.

이때, 상기 사용자의 성별은, 성대 진동 주파수와 지터(Jitter)를 통해 분석하는 것을 특징으로 한다.At this time, the gender of the user is characterized by analyzing the vocal cord vibration frequency and jitter.

또한, 상기 사용자의 연령은, 시머(shimmer)와 NHR(noise-to-harmonics ratio)을 통해 분석하는 것을 특징으로 한다.In addition, the age of the user is characterized by analyzing through a shimmer and a noise-to-harmonics ratio (NHR).

또한, 상기 사용자의 성별과 연령대 분석은 선형 예측 계수 방법, 캡스트럼 방법, 멜프리퀸스캡스트럼 방법, 주파수 대역별 에너지스펙트럼 방법, 가우시안 혼합모델, 신경망 모델, 지지벡터머신 및 은닉마코브모델 중 적어도 어느 하나를 활용하여 분석하는 것을 특징으로 한다.In addition, the analysis of the user's gender and age group includes at least one of a linear prediction coefficient method, a capstrum method, a Melfrequin's capstrum method, an energy spectrum method for each frequency band, a Gaussian mixed model, a neural network model, a support vector machine, and a hidden Markov model. It is characterized by using any one to analyze.

또한, 상기 응답모드는, 사용자의 음성이 중성일 경우 중성으로 응답하는 것을 특징으로 한다.In addition, in the response mode, when the user's voice is neutral, a neutral response is provided.

또한, 사용자의 연령대보다 낮은 연령대의 음성으로 응답하는 것을 특징으로 한다.In addition, it is characterized in that it responds with a voice of an age group lower than that of the user.

이상에서 상술한 본 발명은 다음과 같은 효과가 있다.The present invention described above has the following effects.

먼저, 사용자의 성별과 다른 이성의 목소리로 응답하도록 구성함으로써, 사용자의 호감도 및 친밀감을 증대시킬 수 있다.First, by configuring to respond with a voice of the opposite sex different from the user's gender, it is possible to increase the user's affinity and intimacy.

추가적으로, 사용자의 연령대보다 더 낮은 연령대의 목소리를 제공함으로써, 상기 호감도 및 친밀감을 더욱 증대시킬 수 있다.Additionally, by providing a voice of an age lower than that of the user, the likelihood and intimacy can be further increased.

또한, 전술한 구성에 의해 종래 대비 사회적 성 차별 이슈 및 성 고정관념 문제를 해소시킬 수 있다.In addition, it is possible to solve the problem of social gender discrimination and gender stereotypes compared to the prior art by the above configuration.

도 1은 종래 일반적인 인공지능 음성 인식 시스템의 구성도,
도 2는 본 발명인 인공지능 음성 인식 처리 시스템의 일 실시례에 따른 구성도,
도 3은 본 발명에 따른 성별과 연령별 성대 진동 주파수를 나타낸 도면,
도 4는 본 발명에 따른 성별과 연령별 지터를 나타낸 도면,
도 5는 본 발명에 따른 성별과 연령별 시머를 나타낸 도면,
도 6은 본 발명에 따른 성별과 연령별 NHR을 나타낸 도면.
도 7은 본 발명인 인공지능 음성 인식 처리 방법의 일 실시례에 따른 구성도,
도 8은 본 발명인 인공지능 음성 인식 처리 방법에 따른 음성 비서 수동선택 방법을 나타낸 도면,
도 9는 본 발명인 인공지능 음성 인식 처리 방법에 따른 음성 비서 자동선택 방법을 나타낸 도면.1 is a configuration diagram of a conventional general artificial intelligence speech recognition system,
Figure 2 is a configuration diagram according to an embodiment of the present inventors artificial intelligence speech recognition processing system,
3 is a view showing the vibration frequency of the vocal cords according to sex and age according to the present invention;
4 is a diagram showing jitter by gender and age according to the present invention;
5 is a view showing a seamer by gender and age according to the present invention,
6 is a view showing the NHR by gender and age according to the present invention.
7 is a configuration diagram according to an embodiment of the present inventors artificial intelligence speech recognition processing method,
8 is a diagram showing a method for manually selecting a voice assistant according to an artificial intelligence voice recognition processing method according to the present invention;
9 is a diagram showing a method for automatically selecting a voice assistant according to an artificial intelligence voice recognition processing method according to the present invention.

이하, 본 발명의 일부 실시례들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시례를 설명함에 있어, 관련된 공지구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시례에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail through exemplary drawings. In adding reference numerals to elements of each drawing, it should be noted that the same elements are assigned the same numerals as possible even if they are indicated on different drawings. In addition, in describing the embodiments of the present invention, if it is determined that a detailed description of a related known configuration or function interferes with the understanding of the embodiments of the present invention, the detailed description thereof will be omitted.

또한, 본 발명의 실시례의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the constituent elements of the embodiment of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, order, or order of the component is not limited by the term. When a component is described as being "connected", "coupled" or "connected" to another component, that component may be directly connected or connected to that other component, but another component between each component It should be understood that may be “connected”, “coupled” or “connected”.

먼저, 본 발명은 공개된 인공지능 음성인식 처리 시스템에 모두 적용할 수 있는 것으로써, 인공지능 자체는 공지의 기술이므로 자세한 설명은 생략하도록 한다.First, the present invention is applicable to all publicly disclosed artificial intelligence speech recognition processing systems, and since artificial intelligence itself is a known technology, detailed descriptions will be omitted.

도 2는 본 발명인 인공지능 음성 인식 처리 시스템의 일 실시례에 따른 구성도이고, 도 3은 본 발명에 따른 성별과 연령별 성대 진동 주파수를 나타낸 도면이며, 도 4는 본 발명에 따른 성별과 연령별 지터를 나타낸 도면이고, 도 5는 본 발명에 따른 성별과 연령별 시머를 나타낸 도면이며, 도 6은 본 발명에 따른 성별과 연령별 NHR을 나타낸 도면이다. 도 2 내지 도 6을 참조하여 본 발명인 인공지능 음성 인식 처리 시스템을 설명하면 다음과 같다.2 is a configuration diagram according to an embodiment of an artificial intelligence speech recognition processing system according to the present invention, FIG. 3 is a diagram showing vocal cord vibration frequencies by gender and age according to the present invention, and FIG. 4 is a jitter by gender and age according to the present invention. 5 is a view showing a seamer by gender and age according to the present invention, and FIG. 6 is a view showing NHR by gender and age according to the present invention. An artificial intelligence speech recognition processing system according to the present invention will be described with reference to FIGS. 2 to 6 as follows.

도 1을 참조하면, 본 발명에 따른 인공지능(AI : Artificial Intelligence) 음성 인식 처리 시스템은 음성검출부(100), 음성분석부(200), 음성처리부(300) 및 음성출력부(400)를 포함하여 구성된다.Referring to FIG. 1, the artificial intelligence (AI) speech recognition processing system according to the present invention includes a speech detection unit 100, a speech analysis unit 200, a speech processing unit 300, and a speech output unit 400. It is composed by

음성검출부(100)는 마이크 등을 통해 입력되는 사용자(발화자)의 음성만을 검출한다. 음성검출부(100)는 음성 인식을 위해 널리 사용 되는 기술로 인간의 음성을 이용하여 기계 및 사용 장치를 동작시키는 수단으로서도 중요한 역할을 한다.The voice detection unit 100 detects only the voice of a user (speaker) input through a microphone or the like. The voice detection unit 100 is a technology widely used for voice recognition and plays an important role as a means of operating machines and devices using human voice.

음성신호의 인식 기술은 크게 음성 인식(Speech Recognition)과 화자인식(Speaker Recognition)으로 분류된다. 음성 인식은 다시 특정 화자에 대해서만 인식하는 “화자종속(Speaker Dependent) 시스템”과, 화자에 상관없이 인식하는 “화자독립(Speaker Independent) 시스템”으로 나뉘어진다.Speech signal recognition technologies are largely classified into speech recognition and speaker recognition. Speech recognition is again divided into a “speaker dependent system” that recognizes only a specific speaker, and a “speaker independent system” that recognizes regardless of the speaker.

화자종속 음성 인식은 사용 전에, 사용자의 음성을 저장 및 등록시키고, 실제 인식을 수행할 때는 입력된 음성의 패턴과 저장된 음성의 패턴을 비교하여 인식하는 기술이다.Speaker-dependent voice recognition is a technology that stores and registers the user's voice before use, and compares the pattern of the input voice with the pattern of the stored voice when real recognition is performed.

음성 입력은 마이크 등에서 발화자(사용자)의 음성 신호를 입력 받게 되면 음성 부문만을 검출하여야 하는데, 음성 검출 부문은 인식 성능에 큰 영향을 미친다. 잡음이 많이 환경에서 화자의 음성 신호의 검출 구간에 잡음이 포함되는 경우가 많으므로 음성 인식률을 높이기 위해서는 음성검출부(100)를 구성하는 것이 바람직하다.As for the voice input, when a voice signal of a talker (user) is inputted from a microphone, etc., only the voice section should be detected, but the voice detection section has a great influence on recognition performance. In a noisy environment, since noise is often included in the detection section of the speaker's voice signal, it is preferable to configure the voice detection unit 100 to increase the voice recognition rate.

음성분석부(200)는 음성검출부(100)를 통해 검출된 사용자의 음성정보를 토대로 음성의 의미를 파악하고, 사용자의 성별과 연령대를 분석한다. 음성의 의미란 사용자가 의도하는 바를 파악하는 것으로 자연어처리를 통해 분석할 수 있다. 자연어처리(Natural Language Processing, NLP)는 인간의 언어를 컴퓨터가 이해할 수 있도록 다양한 분석 방법을 통해 기계적인 형태로 변환하는 기술을 의미한다. 또한, 이를 다시 인간이 해석할 수 있는 형태소 만드는 기술도 포함한다. 자연어는 인공지능의 하위 분야로 1960년대의 인공지능을 만들려던 시도가 실패한 후에 인간의 언어를 분석하고 이해하는 기술이 세분화되면서 파생된 학문 분야로 언어공학, 인공지능, 전산언어학의 연구 분야이다. 자연어(Natural Language)는 프로그래밍 언어처럼 사람이 인공적으로 만든 언어가 아닌 과거에 오랜 시간을 거쳐 자연스럽게 발생한 의사소통을 위해 사용해 온 한국어나 영어 같은 언어를 의미한다. 일반적으로 공학에서 언어라고 하면 C나 JAVA와 같은 프로그래밍 언어를 떠오르기 때문에 사람이 사용하는 자연어를 구분하여 부르고 있다. 음성의 의미를 파악하는 기술 자체 역시 공지의 기술이므로 음성의 의미를 파악하는 구체적인 방법론에 대한 설명은 생략하도록 한다.The voice analysis unit 200 grasps the meaning of the voice based on the user's voice information detected through the voice detection unit 100 and analyzes the user's gender and age range. The meaning of voice is to understand what the user intends and can be analyzed through natural language processing. Natural Language Processing (NLP) refers to a technology that transforms human language into a mechanical form through various analysis methods so that a computer can understand it. In addition, it includes the technology to make morphemes that can be interpreted by humans again. Natural language is a sub-field of artificial intelligence. It is a field of study in language engineering, artificial intelligence, and computational linguistics. Natural language refers to a language such as Korean or English that has been used for communication that has occurred naturally over a long time in the past, not a language created artificially by humans like programming languages. Generally speaking, in engineering, programming languages such as C and JAVA come to mind, so the natural language used by humans is called separately. Since the technology itself for grasping the meaning of speech is also a known technology, a description of a specific methodology for grasping the meaning of speech will be omitted.

한편, 자연어 처리가 가능한 인공지능 기기의 예로 챗봇(Chat Bot)을 들 수 있다. 챗봇(Chat bot)은 말 그대로 '채팅(Chatting)'과 '로봇(Robot)'의 합성어로써 사람처럼 대화(채팅)하는 로봇을 의미할 수 있다. 챗봇은 구글, 아마존, 애플 등에서 제조하여 판매하고 있는 구글홈, 알렉사, 시리 등의 가정용 기기일 수 있으며, 고객 응대 등을 하는 기업형 기기일 수도 있다.On the other hand, an example of an artificial intelligence device capable of natural language processing is a Chat Bot. Chat bot is literally a compound word of'chatting' and'robot' and can mean a robot that communicates (chat) like a human. The chatbot may be a home device such as Google Home, Alexa, and Siri manufactured and sold by Google, Amazon, and Apple, or it may be an enterprise device that responds to customers.

음성분석부(200)는 사용자의 성별과 연령대를 분석하기 위해 특징 벡터 산출부, 성별 검출 모듈부 및 연령 검출 모듈부를 구비할 수 있다.The voice analysis unit 200 may include a feature vector calculation unit, a gender detection module unit, and an age detection module unit to analyze a user's gender and age group.

특징 벡터 산출, 성별 검출 및 연령 검출을 위해 선형 예측 계수 방법, 캡스트럼 방법, 멜프리퀸스캡스트럼(MFCC) 방법, 주파수 대역별 에너지스펙트럼 방법, 가우시안 혼합모델(GMM), 신경망 모델(NNM), 지지벡터머신(SVM) 및 은닉마코브모델(HMM) 중 적어도 어느 하나를 활용하여 분석할 수 있으며, 상기 열거한 방법 및 모델들 역시 공지의 기술이므로 자세한 설명은 생략하도록 한다.For feature vector calculation, gender detection and age detection, linear prediction coefficient method, capstrum method, Melfrequin's capstrum (MFCC) method, energy spectrum method for each frequency band, Gaussian mixed model (GMM), neural network model (NNM), Analysis can be performed by using at least one of a support vector machine (SVM) and a hidden marker model (HMM), and since the methods and models listed above are also known techniques, detailed descriptions will be omitted.

도 3을 참조하면, 'Fo' 는 성대 진동 주파수를 나타내는 것으로 지각적으로는 음높이(pitch)에 해당한다. 도 4에 도시된 지터(Jitter)란 진동의 주기가 얼마나 일정한지를 보여주는 수치를 나타낸다.Referring to FIG. 3,'Fo' denotes a vocal cord vibration frequency and perceptually corresponds to a pitch. Jitter shown in FIG. 4 represents a number showing how constant the period of vibration is.

도 5의 시머(shimmer)란 진동의 진폭이 얼마나 일정한지를 보여주는 수치로 주기나 진폭이 불규칙할수록 시머의 값이 커지며, 도 6의 NHR(noise-to-harmonics ratio)은 70 ~ 4,500Hz 사이에 존재하는 배음과 1,500 ~ 4,500Hz 사이에 존재하는 비정상 배음간의 비율 평균치로 그 값이 클수록 소음의 비율이 높음을 나타낸다.The shimmer of FIG. 5 is a value showing how constant the amplitude of the vibration is. The more the period or amplitude is irregular, the greater the value of the seamer, and the noise-to-harmonics ratio (NHR) of FIG. 6 exists between 70 and 4,500 Hz. This is the average value of the ratio between the overtones and the abnormal overtones that exist between 1,500 and 4,500Hz. The larger the value, the higher the ratio of noise.

도 3 및 도 4를 참조하면, 발화자의 성별에 유의미한 차이를 보이는 것은 하기와 같이 성대 진동 주파수 'Fo' 와 '지터' 이다.3 and 4, it is the vocal cord vibration frequencies'Fo' and'jitter' that show a significant difference in the sex of the talker.

- 진동 주파수 Fo : 남성 119.02 ± 22.71 Hz, 여성 199.60 ± 26.93 Hz-Vibration frequency Fo: male 119.02 ± 22.71 Hz, female 199.60 ± 26.93 Hz

- 지터 : 남성 0.24 ± 0.15 %, 여성 0.14 ± 0.11%-Jitter: Male 0.24 ± 0.15%, female 0.14 ± 0.11%

또한, 도 5 및 도 6을 참조하면, 발화자의 연령별에 유의미한 차이를 보이는 것은 하기와 같이 '시머' 와 'NHR' 이다.In addition, referring to FIGS. 5 and 6, it is'Simmer' and'NHR' as follows that show a significant difference according to the age of the talker.

- 시머 : 남성 6.05 ± 5.16 %, 여성 5.90 ± 4.69 %-Seamer: 6.05 ± 5.16% for men, 5.90 ± 4.69% for women

- NHR : 남성 0.0192 ± 0.02, 여성 0.013 ± 0.01-NHR: Male 0.0192 ± 0.02, Female 0.013 ± 0.01

상기와 같은 특징의 차이에 기초하여 성별 및 연령을 구분할 수 있는 임계 값을 통해 발화자의 특징을 산출할 수 있다.The feature of the talker may be calculated through a threshold value capable of distinguishing gender and age based on the difference in features as described above.

음성처리부(300)는 음성분석부(200)를 통해 분석된 사용자의 성별과 연령대에 따라 기설정된 응답모드를 자동으로 선택한다. 응답모드는 사용자의 성별과 반대되는 다른 성별의 목소리를 선택하고, 사용자의 음성이 중성일 경우 중성의 목소리를 선택하며, 사용자의 연령대보다 낮은 연령대의 목소리를 선택하도록 구성된다.The voice processing unit 300 automatically selects a preset response mode according to the gender and age group of the user analyzed by the voice analysis unit 200. The response mode is configured to select a voice of another gender opposite to the user's gender, select a neutral voice when the user's voice is neutral, and select a voice of an age lower than the user's age.

즉, 사용자가 남성이면 여성 목소리를, 사용자가 여성이면 남성 목소리를, 사용자가 중성이면 중성 목소리로 응답하며, 이와 함께, 사용자의 연령대가 20~30대이면 20대의 음성으로 응답하고, 사용자의 연령대가 40~50대이면 30대의 음성으로 응답하며, 사용자의 연령대가 60대이면 10대의 음성으로 응답하도록 구성할 수 있다. 이때, 사용자의 목소리가 중성일 경우 20대 중성 목소리를 기본적으로 선택하도록 처리할 수 있다.In other words, if the user is male, a female voice, if the user is female, a male voice, and if the user is neutral, a neutral voice. In addition, if the user's age range is in their 20s to 30s, they respond with a voice in their 20s. If you are in your 40s to 50s, you can respond with a voice in your 30s, and if your age is in your 60s, you can configure it to respond with a voice in your teens. In this case, when the user's voice is neutral, it may be processed to select a neutral voice in his twenties by default.

음성출력부(400)는 음성처리부(300)에서 선택된 응답모드로 전술한 음성의 의미에 대응되는 응답을 스피커 등을 통해 출력한다.The voice output unit 400 outputs a response corresponding to the meaning of the above-described voice in a response mode selected by the voice processing unit 300 through a speaker or the like.

즉, 본 발명에 따른 인공지능 음성 인식 처리 시스템은 사용자의 목소리를 통해 사용자의 성별과 연령대를 파악하여 사용자의 성별과 다른 성별의 음성을 자동으로 출력함은 물론, 사용자의 연령대보다 낮은 연령대의 음성을 자동으로 출력하는 것이 핵심 기술이라 할 것이다. 이는 남성은 고음 영역대의 맑은 목소리에 매력을 느끼고, 여성은 중저음 영역대의 남성 목소리에 매력과 친밀감을 느낀다는 사실을 기초로 한 것이다.That is, the artificial intelligence speech recognition processing system according to the present invention recognizes the user's gender and age group through the user's voice, and automatically outputs the voice of a gender different from the user's, as well as voices of an age lower than the user's age group. The key technology is to automatically print out. This is based on the fact that men are attracted to the clear voices in the high-pitched range, and women feel attractive and intimate with the male voices in the mid-to-low range.

도 7은 본 발명인 인공지능 음성 인식 처리 방법의 일 실시례에 따른 구성도이고, 도 8은 본 발명인 인공지능 음성 인식 처리 방법에 따른 음성 비서 수동선택 방법을 나타낸 도면이며, 도 9는 본 발명인 인공지능 음성 인식 처리 방법에 따른 음성 비서 자동선택 방법을 나타낸 도면이다.7 is a configuration diagram according to an embodiment of an artificial intelligence speech recognition processing method of the present inventor, FIG. 8 is a diagram showing a manual selection method of a voice assistant according to the artificial intelligence speech recognition processing method of the present inventor, and FIG. A diagram showing a method of automatically selecting a voice assistant according to an intelligent voice recognition processing method.

먼저, 도 7을 참조하여 본 발명인 인공지능 음성 인식 처리 방법을 설명하면 다음과 같다.First, an artificial intelligence speech recognition processing method according to the present invention will be described with reference to FIG. 7 as follows.

본 발명인 인공지능 음성 인식 처리 방법은, 음성을 검출하는 단계(S100), 사용자의 성별과 연령대를 분석하는 단계(S200), 응답모드를 자동으로 선택하는 단계(S300) 및 응답을 출력하는 단계(S400)를 포함하여 구성된다.The present inventors' artificial intelligence speech recognition processing method includes the steps of detecting voice (S100), analyzing the user's gender and age (S200), automatically selecting a response mode (S300), and outputting a response ( S400).

음성을 검출하는 단계(S100)는 전술한 음성검출부(100)의 설명을 참조할 수 있으며, 사용자의 성별과 연령대를 분석하는 단계(S200)는 전술한 음성분석부(200)의 설명을 참조할 수 있다. 또한, 응답모드를 자동으로 선택하는 단계(S300)는 전술한 음성처리부(300)의 설명을 참조할 수 있으며, 응답을 출력하는 단계(S400)는 전술한 음성출력부(400)의 설명을 참조할 수 있으므로 자세한 설명은 생략하도록 한다.The step of detecting the voice (S100) may refer to the description of the above-described voice detection unit 100, and the step of analyzing the user's gender and age range (S200) may refer to the description of the aforementioned voice analysis unit 200. I can. In addition, the step of automatically selecting the response mode (S300) may refer to the description of the above-described voice processing unit 300, and the step of outputting the response (S400) refer to the description of the above-described voice output unit 400. This can be done, so a detailed description will be omitted.

본 발명에 따른 인공지능 음성 인식 처리 방법은 수동 모드 및 자동 모드로 구성될 수 있다.The artificial intelligence speech recognition processing method according to the present invention may be configured in a manual mode and an automatic mode.

도 8을 참조하여 본 발명의 수동 모드에 의한 인공지능 음성 인식 처리 방법의 알고리즘 구현 상태를 설명하면 다음과 같다.An algorithm implementation state of the artificial intelligence speech recognition processing method according to the passive mode of the present invention will be described with reference to FIG. 8 as follows.

먼저, 제품의 하드웨어의 물리적 버튼(비서 성별 선택 버튼)을 누르면 사용자가 '비서 수동 선택'을 할 것인지 여부를 결정하게 된다. 이때, '비서 성별 선택 3가지 안내 음성' 가이드를 사용자에게 들려주어 선택하도록 구성할 수 있다.First, when the physical button of the product's hardware (the secretary gender selection button) is pressed, the user decides whether or not to select'manual secretary'. At this time, it may be configured to select the'secretary gender selection 3 guidance voice' by listening to the user.

일례로, 사전에 데이터베이스(DB)에 등록된 인공지능 음성 비서 목소리 중 20대의 목소리를 스피커로 출력하여 사용자가 선택하도록 한다. 수동 모드의 선택이 완료되면 사용자의 목소리(호출어)의 성별에 관계없이 사용자의 명령에 대한 수행 결과를 사전에 사용자가 선택한 인공지능 음성 비서 목소리가 스피커를 통해 출력된다.For example, among the voices of artificial intelligence voice assistants registered in the database (DB) in advance, voices in their 20s are output through a speaker to be selected by the user. When the selection of the manual mode is completed, the voice of the artificial intelligence voice assistant selected by the user in advance is output through the speaker, regardless of the gender of the user's voice (calling language).

다음으로, 도 9를 참조하여 본 발명의 자동 모드에 의한 인공지능 음성 인식 처리 방법의 알고리즘 구현 상태를 설명하면 다음과 같다.Next, an algorithm implementation state of the artificial intelligence speech recognition processing method according to the automatic mode of the present invention will be described with reference to FIG. 9.

'성별 검출 모듈부' 에서 사용자가 인공지능 음성 비서 호출어를 발화하면 호출어의 성별에 무관하게 사용자의 음성 명령 목소리를 가지고 성별을 검출한다.When the user utters the artificial intelligence voice assistant call word in the'gender detection module unit', it detects the gender with the user's voice command voice regardless of the gender of the caller.

1차적으로, 성별은 성대 진동 주파수(Fo)와 지터 값의 성별 기준 값을 참조하여 발화자의 목소리의 특징을 추출하여 성별을 분류하고, 분류된 성별은 '성별 인식 레지스터'에 저장한다.First, the gender is classified by extracting the features of the speaker's voice by referring to the gender reference value of the vocal cord vibration frequency (Fo) and the jitter value, and the classified gender is stored in a'gender recognition register'.

2차적으로, 연령대를 분류하기 위해 연령 검출 모듈부에서 발화자의 호출어 음성의 시머와 NHR 값의 연령대 구분 임계 값을 참조하여 발화자의 특징을 추출한 후 연령대를 분류하여 '연령 인식 레지스터'에 저장한다.Secondly, in order to classify the age group, the age detection module unit extracts the speaker's features by referring to the age group threshold value of the caller's caller voice and the NHR value, and then classifies the age group and stores it in the'age recognition register'. .

발화자의 성별 및 연령대 분류가 완료되면 음성처리부(300)에서 기 저장된 7가지 음성 모드 데이터 베이스(DB)를 활용하여 선택된 음성을 출력한다.When the classification of the sex and age group of the talker is completed, the voice processing unit 300 outputs the selected voice using the seven voice mode databases (DB) previously stored.

즉, 발화자의 목소리가 남성일 경우는 이성형의 여성 인공지능 음성 비서 목소리로 대응한다. 따라서, 발화자가 30대 연령이면 여성 20대 인공지능 음성 비서로, 발화자가 40~50대 연령이면 여성 30대 인공지능 음성 비서로, 발화자가 60대 이상이면 여성 10대 인공지능 음성 비서로 처리한다.In other words, if the speaker's voice is male, the voice of a female artificial intelligence voice assistant of Lee Sung-hyung responds. Therefore, if the talker is in their 30s, it is treated as a female 20s AI voice assistant, if the talker is in their 40s to 50s, it is treated as a female 30s AI voice assistant, and if the talker is over 60, it is treated as a female teenage AI voice assistant. .

또한, 발화자의 목소리가 여성일 경우는 이성형의 남성 인공지능 음성 비서 목소리로 대응한다. 따라서, 발화자가 30대 연령이면 남성 20대 인공지능 음성 비서로, 발화자가 40~50대 연령이면 남성 30대 인공지능 음성 비서로, 발화자가 60대 이상이면 남성 10대 인공지능 음성 비서로 처리한다.In addition, if the speaker's voice is female, the voice of Lee Sung-hyung's male artificial intelligence voice assistant responds. Therefore, if the talker is in his 30s, it is treated as an artificial intelligence voice assistant for men in his 20s, if the talker is in his 40s to 50s, it is treated as an AI voice assistant in his thirties, and if the talker is in his 60s or older, it is treated as a male teenage AI voice assistant .

또한, 발화자의 목소리가 중성일 경우는 20대 중성 음성 기본으로 처리한다.In addition, when the speaker's voice is neutral, it is treated as a neutral voice in his 20s.

이상에서 설명한 본 발명에 따르면, 사용자(발화자)의 목소리를 자동으로 분석하고 분석된 목소리 정보를 통해 사용자의 성별 및 연령대를 파악한 후 사용자의 성별과 다른 성별의 목소리 또는 중성일 경우 중성의 목소리를 출력하고, 사용자의 연령대보다 낮은 연령대의 목소리를 출력함으로써 인공지능 음성 비서의 목소리에 대한 사용자의 호감도 및 친밀감을 증대시킬 수 있으며, 종래와 같은 사회적 성 차별 이슈 및 성 고정관념을 해소시킬 수 있다.According to the present invention described above, the voice of the user (speaker) is automatically analyzed, the user's gender and age group are identified through the analyzed voice information, and then a voice of a gender different from the user's gender or a neutral voice is output. And, by outputting a voice of an age lower than that of the user, it is possible to increase the user's affinity and intimacy with the voice of an artificial intelligence voice assistant, and solve the social gender discrimination issue and gender stereotype as in the prior art.

이상에서, 본 발명의 실시례를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시례에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재할 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In the above, even if all the constituent elements constituting the embodiments of the present invention are described as being combined into one or operating in combination, the present invention is not necessarily limited to these embodiments. That is, within the scope of the object of the present invention, all of the constituent elements may be selectively combined and operated in one or more. In addition, terms such as "include", "consist of" or "have" described above mean that the corresponding component may be present unless otherwise stated, excluding other components Rather, it should be interpreted as being able to further include other components. All terms, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art, unless otherwise defined. Terms generally used, such as terms defined in the dictionary, should be interpreted as being consistent with the meaning of the context of the related technology, and are not interpreted as ideal or excessively formal meanings unless explicitly defined in the present invention.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 게시된 실시례들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시례에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments posted in the present invention are not intended to limit the technical idea of the present invention, but to explain the technical idea, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

100 : 음성검출부 200 : 음성분석부
300 : 음성처리부 400 : 음성출력부
S100 : 음성을 검출하는 단계
S200 : 사용자의 성별과 연령대를 분석하는 단계
S300 : 응답모드를 자동으로 선택하는 단계
S400 : 응답을 출력하는 단계100: voice detection unit 200: voice analysis unit
300: audio processing unit 400: audio output unit
S100: detecting voice
S200: Analyzing the user's gender and age group
S300: Step of automatically selecting a response mode
S400: Step of outputting a response

Claims

Detecting a user's voice;
Analyzing the meaning of the voice and the user's gender and age group through the detected user's voice;
Automatically selecting a preset response mode according to the analyzed user's gender and age; And
And outputting a response corresponding to the meaning of the voice in the selected response mode; and
Artificial intelligence speech recognition processing method for selecting a voice of the opposite sex different from the user's gender when the response mode is selected.

The method of claim 1,
The user's gender is,
Artificial intelligence speech recognition processing method that analyzes the vocal cord vibration frequency and jitter.

The method of claim 1,
The user's age is,
Artificial intelligence speech recognition processing method that analyzes through shimmer and noise-to-harmonics ratio (NHR).

The method of claim 1,
The user's gender and age group analysis is at least one of a linear prediction coefficient method, a capstrum method, a Melfrequin's capstrum method, an energy spectrum method for each frequency band, a Gaussian mixed model, a neural network model, a support vector machine, and a hidden Markov model. Artificial intelligence speech recognition processing method that is analyzed by using.

The method of claim 1,
The response mode,
Artificial intelligence speech recognition processing method that responds with a neutral voice when the user's voice is neutral.

The method of claim 5,
An artificial intelligence speech recognition processing method that responds with a voice of an age lower than the user's age.

A voice detection unit for detecting a user's voice;
A voice analysis unit that analyzes the meaning of the voice and the user's gender and age group from the user's voice detected through the voice detection unit;
A voice processing unit that automatically selects a preset response mode according to the gender and age group of the user analyzed through the voice analysis unit; And
Includes; a voice output unit for outputting a response corresponding to the meaning of the voice in the selected response mode, and
An artificial intelligence speech recognition processing system that selects a voice of the opposite sex different from the user's gender when the response mode is selected.

The method of claim 7,
The user's gender is,
An artificial intelligence speech recognition processing system that analyzes the vocal cord vibration frequency and jitter.

The method of claim 7,
The user's age is,
An artificial intelligence speech recognition processing system that analyzes through a shimmer and noise-to-harmonics ratio (NHR).

The method of claim 7,
The user's gender and age group analysis is at least one of a linear prediction coefficient method, a capstrum method, a Melfrequin's capstrum method, an energy spectrum method for each frequency band, a Gaussian mixed model, a neural network model, a support vector machine, and a hidden Markov model. Artificial intelligence speech recognition processing system that analyzes by utilizing.

The method of claim 7,
The response mode,
Artificial intelligence speech recognition processing system that responds with neutral when the user's voice is neutral.

The method of claim 11,
Artificial intelligence speech recognition processing system that responds with voices of an age lower than the user's age.