KR102444003B1

KR102444003B1 - Apparatus for authenticating user based on voice image and method thereof

Info

Publication number: KR102444003B1
Application number: KR1020200165034A
Authority: KR
Inventors: 이승룡; 우바이드 유알 레만
Original assignee: 경희대학교 산학협력단
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-09-15
Also published as: KR20220076160A

Abstract

음성 이미지 기반 사용자 인증 장치 및 그 방법이 개시된다.
이 장치는 사용자로부터 입력되는 음성으로부터 음성 특징을 추출하는 음성 특징 추출부를 포함한다. 또한, 음성 특징 추출부에서 추출된 음성 특징에 대응하는 음성 이미지를 생성하는 음성 이미지 변환부, 음성 이미지 변환부에 의해 생성된 음성 이미지를 저장하는 저장부를 더 포함한다. 또한, 저장부에 저장된 음성 이미지에 대해 딥러닝 학습을 수행하여 대응하는 음성 모델을 생성하는 적응형 딥러닝 수행부를 더 포함한다. 또한, 음성 이미지 변환부에 의해 생성된 음성 이미지와 저장부에 저장된 사용자별 음성 이미지의 비교에 의해 사용자 인증을 수행하여 제1 인증 결과를 생성하고, 음성 이미지 변환부에 의해 생성된 음성 이미지에 대해 적응형 딥러닝 수행부에 의해 생성된 음성 모델을 적용하여 사용자 인증을 수행하여 제2 인증 결과를 생성한 후, 제1 인증 결과와 제2 인증 결과를 사용하여 최종 인증 결과를 생성하는 신원 확인부를 더 포함한다.A voice image-based user authentication apparatus and method are disclosed.
The apparatus includes a voice feature extraction unit for extracting voice features from a voice input from the user. The apparatus further includes a voice image converter for generating a voice image corresponding to the voice feature extracted by the voice feature extraction unit, and a storage unit for storing the voice image generated by the voice image converter. In addition, it further includes an adaptive deep learning performing unit for generating a corresponding voice model by performing deep learning learning on the voice image stored in the storage unit. In addition, user authentication is performed by comparing the voice image generated by the voice image converter and the voice image for each user stored in the storage unit to generate a first authentication result, and for the voice image generated by the voice image converter After generating a second authentication result by performing user authentication by applying the voice model generated by the adaptive deep learning performing unit, an identity verification unit that generates a final authentication result using the first authentication result and the second authentication result include more

Description

Voice image-based user authentication device and method {APPARATUS FOR AUTHENTICATING USER BASED ON VOICE IMAGE AND METHOD THEREOF}

본 발명은 음성 이미지 기반 사용자 인증 장치 및 그 방법에 관한 것이다.The present invention relates to a voice image-based user authentication apparatus and method.

새로운 기술 트렌드와 함께, 가상 도우미는 사람들이 일상적인 작업을 보다 효율적으로 완료할 수 있도록 지원한다. 대부분의 가상 도우미는 인공 지능의 개념을 사용하고 캘린더 관리, 약속 만들기, 모닝콜 및 다양한 일상 서비스와 같은 개인화된 지원을 사용자에게 제공한다. Along with new technology trends, virtual assistants are helping people complete their daily tasks more efficiently. Most virtual assistants use the concept of artificial intelligence and provide users with personalized assistance such as calendar management, appointment making, wake-up calls and various daily services.

이와 같은 여러 가지 장점으로 인해, 오늘날 서로 다른 도메인의 많은 애플리케이션은 TV, 모바일 장치, 차량 및 사물 인터넷과 같은 자체 내장 가상 도우미를 갖는다. 이러한 가상 도우미는 가상 에이전트, 대화 관리자, 대화형 에이전트 또는 대화형 비서와 같은 다양한 이름을 가지고 있다. 많은 유명 기업(예를 들어, Amazon, Apple, Samsung 및 Google)이 자체 가상 비서를 출시하였다. Because of these many advantages, many applications in different domains today have their own built-in virtual assistants such as TVs, mobile devices, vehicles and the Internet of Things. These virtual assistants have various names such as virtual agent, conversation manager, interactive agent, or interactive assistant. Many well-known companies (eg, Amazon, Apple, Samsung, and Google) have launched their own virtual assistants.

일반적인 가상 도우미는 물론 중요하게는 전문 분야에서의 가상 도우미, 예를 들어, 환자의 신체 증상을 기반으로 질병을 예측하는 데 사용되는 가상 도우미 등에서는 도움을 요청하는 사용자를 음성 기반으로 정확하게 식별하여 세션 관리하는 것이 무엇보다도 중요하다.General virtual assistants, as well as importantly specialized virtual assistants, such as virtual assistants used to predict disease based on a patient's physical symptoms, accurately identify the user requesting help based on their voice and ensure that the session is complete. Management is the most important thing.

그러나, 종래의 가상 도우미에서 사용되는 음성 기반 인증은 단순히 사용자의 음성으로부터 특징을 추출하여 기존의 사용자 음성과 유사도를 판별하여 사용자 인증을 수행하는 방식으로만 수행되어 사용자 식별의 정확성이 떨어진다는 문제점이 있다.However, the voice-based authentication used in the conventional virtual assistant is performed only by extracting features from the user's voice and determining the similarity with the existing user's voice to perform user authentication, so there is a problem that the accuracy of user identification is lowered. have.

따라서, 보다 효율적이면서 향상된 음성 기반 인증을 통한 사용자 식별 방식이 요구되고 있다.Accordingly, there is a demand for a more efficient and improved user identification method through voice-based authentication.

본 발명이 해결하고자 하는 과제는 보다 효율적이면서 정확한 음성 기반 인증을 통해 기존보다 향상된 사용자 식별이 가능한 음성 이미지 기반 사용자 인증 장치 및 그 방법을 제공하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to provide a voice image-based user authentication device and method capable of more efficient and more accurate voice-based authentication than the existing ones, enabling improved user identification.

상기한 바와 같은 본 발명의 과제를 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한, 본 발명의 특징적인 구성은 하기와 같다.A characteristic configuration of the present invention for achieving the above-described object of the present invention and realizing the characteristic effects of the present invention described later is as follows.

본 발명의 일 측면에 따르면, 음성 이미지 기반 사용자 인증 장치가 제공되며, 이 장치는,According to one aspect of the present invention, there is provided a voice image-based user authentication device, the device comprising:

사용자로부터 입력되는 음성으로부터 음성 특징을 추출하는 음성 특징 추출부, 상기 음성 특징 추출부에서 추출된 음성 특징에 대응하는 음성 이미지를 생성하는 음성 이미지 변환부, 상기 음성 이미지 변환부에 의해 생성된 음성 이미지를 저장하는 저장부, 상기 저장부에 저장된 음성 이미지에 대해 딥러닝 학습을 수행하여 대응하는 음성 모델을 생성하는 적응형 딥러닝 수행부, 그리고 상기 음성 이미지 변환부에 의해 생성된 음성 이미지와 상기 저장부에 저장된 사용자별 음성 이미지의 비교에 의해 사용자 인증을 수행하여 제1 인증 결과를 생성하고, 상기 음성 이미지 변환부에 의해 생성된 음성 이미지에 대해 상기 적응형 딥러닝 수행부에 의해 생성된 음성 모델을 적용하여 사용자 인증을 수행하여 제2 인증 결과를 생성한 후, 상기 제1 인증 결과와 상기 제2 인증 결과를 사용하여 최종 인증 결과를 생성하는 신원 확인부를 포함한다.A voice feature extractor for extracting voice features from a voice input from a user, a voice image converter for generating a voice image corresponding to the voice feature extracted by the voice feature extractor, a voice image generated by the voice image converter A storage unit for storing, an adaptive deep learning performing unit generating a corresponding voice model by performing deep learning learning on the voice image stored in the storage unit, and the voice image generated by the voice image conversion unit and the storage A first authentication result is generated by performing user authentication by comparing voice images for each user stored in the unit, and a voice model generated by the adaptive deep learning performing unit for the voice image generated by the voice image converting unit. and an identity verification unit configured to generate a second authentication result by performing user authentication by applying , and then generating a final authentication result by using the first authentication result and the second authentication result.

본 발명의 다른 측면에 따르면, 음성 이미지 기반 사용자 인증 방법이 제공되며, 이 방법은,According to another aspect of the present invention, there is provided a voice image-based user authentication method, the method comprising:

음성 이미지 기반 사용자 인증 장치가 사용자로부터 입력되는 음성에 대응하는 음성 이미지 기반으로 사용자 인증을 수행하는 방법으로서, 사용자로부터 입력되는 음성으로부터 추출된 음성 특징에 대응하는 음성 이미지를 생성하는 단계, 미리 생성된 음성 모델이 있는지의 여부를 판단하는 단계, 그리고 상기 음성 모델이 없는 경우, 상기 음성 이미지와 미리 저장된 사용자별 음성 이미지의 비교에 의해 사용자 인증을 수행하여 제1 인증 결과를 생성하거나, 또는 상기 음성 모델이 있는 경우, 생성된 음성 이미지와 미리 저장된 사용자별 음성 이미지의 비교에 의해 사용자 인증을 수행하여 제1 인증 결과를 생성하고, 상기 생성된 음성 이미지에 상기 음성 모델을 적용하여 사용자 인증을 수행하여 제2 인증 결과를 생성하고, 상기 제1 인증 결과와 상기 제2 인증 결과를 사용하여 최종 인증 결과를 생성하는 단계를 포함한다.A method for a voice image-based user authentication device to perform user authentication based on a voice image corresponding to a voice input from a user, the method comprising: generating a voice image corresponding to a voice feature extracted from a voice input from the user; Determining whether there is a voice model, and if there is no voice model, performing user authentication by comparing the voice image with a previously stored voice image for each user to generate a first authentication result, or the voice model In this case, the first authentication result is generated by performing user authentication by comparing the generated voice image with the previously stored voice image for each user, and the user authentication is performed by applying the voice model to the generated voice image. 2 generating an authentication result, and generating a final authentication result using the first authentication result and the second authentication result.

본 발명에 따르면, 사용자의 음성 입력에 대해 음성 신호의 특징에 대응하는 음성 이미지 기반으로 인증을 수행하고, 또한 음성 이미지의 저장 개수에 따라 음성 이미지에 대한 딥러닝을 수행하여 생성되는 음성 모델을 사용한 인증을 병행 사용함으로써, 보다 효율적이면서 정확한 음성 기반 인증을 통해 기존보다 향상된 사용자 식별이 가능하다.According to the present invention, the user's voice input is authenticated based on a voice image corresponding to the characteristics of the voice signal, and a voice model generated by performing deep learning on the voice image according to the number of stored voice images is used. By using authentication in parallel, it is possible to identify users better than before through more efficient and accurate voice-based authentication.

또한, 음성 이미지 기반의 인증을 통해 사용자 신원을 정확하게 확인하고 검증함으로써, 세션 가로채기를 방지하고 사용자의 프라이버시를 보장할 수 있다.In addition, by accurately confirming and verifying user identity through voice image-based authentication, session hijacking can be prevented and user privacy can be guaranteed.

도 1은 본 발명의 실시예에 따른 음성 이미지 기반 사용자 인증 방식이 사용되는 가상 도우미 장치의 개략적인 구성 블록도이다.
도 2는 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치의 개략적인 구성 블록도이다.
도 3은 도 2에 도시된 음성 특징 추출부의 구체적인 구성 블록도이다.
도 4는 도 2에 도시된 음성 이미지 변환부의 구체적인 구성 블록도이다.
도 5는 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치에서 특징 벡터를 RGB 음성 이미지로 변환하기 위한 첫 번째 과정을 도시한 도면이다.
도 6은 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치에서 특징 벡터를 RGB 음성 이미지로 변환하기 위한 두 번째 과정을 도시한 도면이다.
도 7은 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치에서 특징 벡터를 RGB 음성 이미지로 변환하기 위한 세 번째 과정을 도시한 도면이다.
도 8은 도 2에 도시된 적응형 딥러닝 수행부의 구체적인 구성 블록도이다.
도 9는 도 2에 도시된 신원 확인부의 구체적인 구성 블록도이다.
도 10은 본 발명의 실시예에 따른 음성 이미지 기반 인증 방법의 개략적인 흐름도이다.
도 11은 본 발명의 다른 실시예에 따른 음성 이미지 기반 인증 장치의 개략적인 구성을 나타내는 도면이다.1 is a schematic block diagram of a virtual assistant device using a voice image-based user authentication method according to an embodiment of the present invention.
2 is a schematic block diagram of a voice image-based authentication apparatus according to an embodiment of the present invention.
FIG. 3 is a detailed block diagram of the voice feature extracting unit shown in FIG. 2 .
FIG. 4 is a detailed block diagram of the voice image converter shown in FIG. 2 .
5 is a diagram illustrating a first process for converting a feature vector into an RGB voice image in the voice image-based authentication apparatus according to an embodiment of the present invention.
6 is a diagram illustrating a second process for converting a feature vector into an RGB voice image in the voice image-based authentication apparatus according to an embodiment of the present invention.
7 is a diagram illustrating a third process for converting a feature vector into an RGB voice image in the voice image-based authentication apparatus according to an embodiment of the present invention.
FIG. 8 is a detailed block diagram of the adaptive deep learning performing unit shown in FIG. 2 .
FIG. 9 is a detailed configuration block diagram of the identification unit shown in FIG. 2 .
10 is a schematic flowchart of a voice image-based authentication method according to an embodiment of the present invention.
11 is a diagram showing a schematic configuration of a voice image-based authentication apparatus according to another embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. have.

본 발명에서 설명하는 장치들은 적어도 하나의 프로세서, 메모리 장치, 통신 장치 등을 포함하는 하드웨어로 구성되고, 지정된 장소에 하드웨어와 결합되어 실행되는 프로그램이 저장된다. 하드웨어는 본 발명의 방법을 실행할 수 있는 구성과 성능을 가진다. 프로그램은 도면들을 참고로 설명한 본 발명의 동작 방법을 구현한 명령어(instructions)를 포함하고, 프로세서와 메모리 장치 등의 하드웨어와 결합하여 본 발명을 실행한다. The devices described in the present invention are composed of hardware including at least one processor, a memory device, a communication device, and the like, and a program executed in combination with the hardware is stored in a designated place. The hardware has the configuration and capability to implement the method of the present invention. The program includes instructions for implementing the method of operation of the present invention described with reference to the drawings, and is combined with hardware such as a processor and a memory device to execute the present invention.

이하, 도면을 참조하여 본 발명의 실시예에 따른 음성 이미지 기반 사용자 인증 장치에 대해 설명한다.Hereinafter, a voice image-based user authentication apparatus according to an embodiment of the present invention will be described with reference to the drawings.

먼저, 본 발명의 실시예에 따른 음성 이미지 기반 사용자 인증 방식이 사용되는 가상 도우미의 예에 대해 설명한다.First, an example of a virtual assistant using a voice image-based user authentication method according to an embodiment of the present invention will be described.

도 1은 본 발명의 실시예에 따른 음성 이미지 기반 사용자 인증 방식이 사용되는 가상 도우미 장치의 개략적인 구성 블록도이다.1 is a schematic block diagram of a virtual assistant device using a voice image-based user authentication method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 음성 이미지 기반 사용자 인증 방식이 사용되는 가상 도우미 장치(10)는 신원 관리부(11), 컨텍스트 관리부(12) 및 지식 코디네이터(13)를 포함한다.Referring to FIG. 1 , a virtual assistant device 10 using a voice image-based user authentication method according to an embodiment of the present invention includes an identity manager 11 , a context manager 12 , and a knowledge coordinator 13 .

신원 관리부(11)는 사용자로부터 입력되는 사용자 음성을 식별하여 대응하는 음성 신호에 대해 음성 이미지 기반 인증 장치(100)를 사용하여 사용자의 신원을 확인하고 관리한다. 여기서, 음성 이미지 기반 인증 장치(100)는 음성 신호를 주파수 도메인으로 변환하여 대응하는 음성 벡터를 생성하고, 생성된 음성 벡터에 대응하는 RGB 이미지, 즉 음성 이미지를 생성한 후 기존에 등록된 사용자들의 음성 이미지와 비교하는 인증을 수행하여 사용자의 신원을 확인한다. 여기서, 음성 벡터를 RGB 이미지로 변환하는 내용에 대해서는 추후 설명한다.The identity management unit 11 identifies and manages the user's identity by using the voice image-based authentication apparatus 100 for a corresponding voice signal by identifying a user's voice input from the user. Here, the voice image-based authentication apparatus 100 converts a voice signal into a frequency domain to generate a corresponding voice vector, generates an RGB image corresponding to the generated voice vector, that is, a voice image, Verifies the identity of the user by performing authentication that compares with the voice image. Here, the content of converting a voice vector into an RGB image will be described later.

컨텍스트 관리부(12)는 신원 관리부(11)에 의해 식별된 사용자 음성에 대응하는 텍스트를 이해할 수 있는 형식으로 처리하고 현재의 컨텍스트를 기반으로 사용자 음성 입력에 대응하는 적절한 응답을 생성한다.The context management unit 12 processes the text corresponding to the user's voice identified by the identity management unit 11 in an understandable format and generates an appropriate response corresponding to the user's voice input based on the current context.

지식 코디네이터(13)는 컨텍스트 관리부(12)에 대한 연결성 및 데이터 지속성 지원을 제공하는 데이터 계층으로, 다양한 형태의 데이터를 저장하여 관리하고 유지하는 여러 저장 장치로 구성된다.The knowledge coordinator 13 is a data layer that provides connectivity and data persistence support to the context management unit 12 , and consists of several storage devices that store, manage, and maintain various types of data.

이하, 전술한 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치(100)에 대해 구체적으로 설명한다.Hereinafter, the voice image-based authentication apparatus 100 according to the above-described embodiment of the present invention will be described in detail.

도 2는 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치(100)의 개략적인 구성 블록도이다.2 is a schematic block diagram of a voice image-based authentication apparatus 100 according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치(100)는 저장부(110), 음성 특징 추출부(120), 음성 이미지 변환부(130), 적응형 딥러닝 수행부(140) 및 신원 확인부(140)를 포함한다.As shown in FIG. 2 , the voice image-based authentication apparatus 100 according to an embodiment of the present invention includes a storage unit 110 , a voice feature extraction unit 120 , a voice image conversion unit 130 , and adaptive deep learning. It includes an execution unit 140 and an identity verification unit 140 .

저장부(110)는 사용자 등록 과정에서 사용자로부터 입력된 음성 신호에 대응하는 음성 이미지를 저장한다. 이 때 저장되는 음성 이미지는 사용자 등록 과정에서 사용자로부터 입력되는 음성에 대해 음성 특징 추출부(120)와 음성 이미지 변환부(130)를 거쳐서 생성된 음성 이미지이다. 또한, 저장되는 음성 이미지는 사용자를 나타내는 식별자, 예를 들어 사용자 단말의 IMEI(International Mobile Equipment Identity) 번호와 바인딩되어 저장될 수 있다.The storage unit 110 stores a voice image corresponding to a voice signal input from the user in the user registration process. At this time, the stored voice image is a voice image generated through the voice feature extracting unit 120 and the voice image converting unit 130 with respect to the voice input from the user in the user registration process. In addition, the stored voice image may be stored by binding to an identifier representing the user, for example, an International Mobile Equipment Identity (IMEI) number of the user terminal.

또한, 저장부(110)는 사용자에 의해 로그인 시도시 입력된 사용자 음성 입력에 대응하는 이미지와 로그인 시도 정보를 또한 저장한다. In addition, the storage unit 110 also stores an image corresponding to the user's voice input input during a login attempt by the user and login attempt information.

음성 특징 추출부(120)는 입력되는 사용자 음성 신호에 대해 음성 신호의 구성요소별로 주파수 도메인에서의 주파수 스펙트럼을 추출한다. 즉, 음성 특징 추출부(120)는 음성 신호의 구성요소별 주파수 스펙트럼을 추출한다. The voice feature extractor 120 extracts a frequency spectrum in the frequency domain for each component of the voice signal with respect to the input user voice signal. That is, the voice feature extraction unit 120 extracts a frequency spectrum for each component of the voice signal.

음성 이미지 변환부(130)는 음성 특징 추출부(120)에서 추출된 음성 신호의 구성요소별 주파수 스펙트럼을 융합하여 대응하는 특징 벡터를 생성하고, 생성된 특징 벡터에 대응하는 RGB 음성 이미지를 생성한다.The voice image converter 130 fuses the frequency spectrum for each component of the voice signal extracted by the voice feature extractor 120 to generate a corresponding feature vector, and generates an RGB voice image corresponding to the generated feature vector. .

적응형 딥러닝 수행부(140)는 저장부(110)에 저장된 음성 이미지의 개수가 미리 설정된 임계값에 도달하는 경우 저장된 음성 이미지들을 사용하여 딥러닝 학습을 수행하여 딥러닝 음성 모델을 생성한다. 여기서, 적응형 딥러닝 수행부(140)는 저장부(110)에 저장된 음성 이미지의 개수가 미리 설정된 임계값의 배수에 도달할 때마다 저장된 음성 이미지에 대해 딥러닝 학습을 수행하여 파인 튜닝(fine tunning)된 딥러닝 음성 모델을 생성한다.When the number of voice images stored in the storage unit 110 reaches a preset threshold value, the adaptive deep learning performing unit 140 performs deep learning learning using the stored voice images to generate a deep learning voice model. Here, the adaptive deep learning performing unit 140 performs deep learning learning on the stored voice images whenever the number of voice images stored in the storage 110 reaches a multiple of a preset threshold value to perform fine tuning (fine tuning). Create a tuned deep learning voice model.

신원 확인부(140)는 세 가지 다른 방식으로 동작한다. 먼저 첫 번째 방식으로, 사용자 등록 과정인 경우에는, 사용자로부터 입력된 음성에 대해 음성 특징 추출부(120)를 거쳐 음성 이미지 변환부(130)에 의해 생성된 음성 이미지를 사용자의 식별자와 함께 바인딩하여 저장부(110)에 저장한다. The identity verification unit 140 operates in three different ways. First, in the first method, in the case of the user registration process, the voice image generated by the voice image conversion unit 130 through the voice feature extraction unit 120 for the voice input from the user is bound together with the identifier of the user. It is stored in the storage unit 110 .

다음, 두 번째 방식으로, 사용자 인증 과정인 경우이면서 적응형 딥러닝 수행부(140)에 의해 음성 모델이 생성되지 않은 경우, 로그인시 사용자로부터 입력된 음성에 대해 음성 특징 추출부(120)를 거쳐 음성 이미지 변환부(130)에 의해 생성된 음성 이미지와, 저장부(110)에 저장되어 있는 사용자별 음성 이미지 및 기존의 로그인시 수집된 음성 이미지를 비교하여 사용자의 신원을 확인함으로써 음성 이미지 기반 인증을 수행한다.Next, in the second method, in the case of the user authentication process and the voice model is not generated by the adaptive deep learning performing unit 140, the voice input from the user at login is passed through the voice feature extraction unit 120 Voice image-based authentication by comparing the voice image generated by the voice image conversion unit 130 with the voice image for each user stored in the storage 110 and the voice image collected during the existing login to confirm the identity of the user carry out

다음, 세 번째 방식으로, 사용자 인증 과정인 경우이면서 적응형 딥러닝 수행부(140)에 의해 음성 모델이 생성된 경우, 두 가지 인증 결과를 사용하여 최종적인 인증 결과를 산출하여 출력한다. 여기서, 두 가지 인증 결과 중 제1 인증 결과는, 로그인시 사용자로부터 입력된 음성에 대해 음성 특징 추출부(120)를 거쳐 음성 이미지 변환부(130)에 의해 생성된 음성 이미지와, 저장부(110)에 저장되어 있는 사용자별 음성 이미지 및 기존의 로그인시 수집된 음성 이미지를 비교한 결과에 따른 사용자의 신원이고, 제2 인증 결과는 적응형 딥러닝 수행부(140)에 의해 생성된 음성 모델을 사용하여 사용자로부터 입력된 음성에 대응되는 음성 이미지에 대한 인증 결과이다. Next, in the third method, when the voice model is generated by the adaptive deep learning performing unit 140 while in the case of the user authentication process, the final authentication result is calculated and output using the two authentication results. Here, the first authentication result among the two authentication results is a voice image generated by the voice image conversion unit 130 through the voice feature extracting unit 120 with respect to the voice input from the user at login, and the storage unit 110 . ) is the identity of the user according to the result of comparing the voice image for each user stored in the existing login and the voice image collected at the time of login, and the second authentication result is the voice model generated by the adaptive deep learning performing unit 140 It is the authentication result for the voice image corresponding to the voice input from the user using

따라서, 세 번째 방식에서 신원 확인부(150)는 제1 인증 결과와 제2 인증 결과가 동일하면 동일한 인증 결과를 최종 인증 결과로 사용하지만, 만약 제1 인증 결과와 제2 인증 결과가 동일하지 않으면, 상황에 따라 제1 인증 결과와 제2 인증 결과 중 하나의 인증 결과를 최종 인증 결과로 사용한다. 예를 들어, 신원 확인부(150)는 저장부(110)에 저장된 음성 이미지의 총 개수와 저장부(110)에 저장된 사용자마다의 음성 이미지 비율에 따라, 제1 인증 결과와 제2 인증 결과 중 하나의 인증 결과를 최종 인증 결과로 사용한다. 예를 들어, 저장부(110)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등하지 않으면 제2 인증 결과가 편향된 결과를 나타낼 수 있으므로, 이 경우에는 제2 인증 결과를 무시하고 제1 인증 결과를 최종 인증 결과로 할 수 있다. 다르게는, 저장부(110)에 저장된 음성 이미지의 사용자마다의 음성 이미지 비율이 사용자별로 균등하고 총 개수가 딥러닝의 결과를 신뢰하기에 충분한 경우에는 제2 인증 결과를 최종 인증 결과로 할 수 있다. 그러나, 저장부(110)에 저장된 음성 이미지의 사용자마다의 음성 이미지 비율이 사용자별로 균등하더라도 총 개수가 딥러닝의 결과를 신뢰하기에 충분하지 않은 경우에는 역시 제1 인증 결과를 최종 인증 결과로 할 수 있다. 여기서, 저장부(110)에 저장된 음성 이미지의 사용자마다의 음성 이미지 비율이 사용자별로 균등한지의 여부를 판단하는 범위는 미리 설정될 수 있다.Therefore, in the third method, if the first authentication result and the second authentication result are the same, the identity verification unit 150 uses the same authentication result as the final authentication result, but if the first authentication result and the second authentication result are not the same, , one of the first authentication result and the second authentication result is used as the final authentication result according to the situation. For example, according to the total number of voice images stored in the storage unit 110 and the voice image ratio for each user stored in the storage unit 110 , the identity verification unit 150 may select one of the first authentication result and the second authentication result. One authentication result is used as the final authentication result. For example, if the voice image ratio for each user stored in the storage 110 is not uniform for each user, the second authentication result may indicate a biased result. In this case, the second authentication result is ignored and the first authentication result is displayed. It can be done as a result of the final certification. Alternatively, if the voice image ratio for each user of the voice image stored in the storage unit 110 is uniform for each user and the total number is sufficient to trust the result of deep learning, the second authentication result may be the final authentication result. . However, even if the voice image ratio for each user of the voice image stored in the storage unit 110 is uniform for each user, if the total number is not sufficient to trust the result of deep learning, the first authentication result is also used as the final authentication result. can Here, the range for determining whether the voice image ratio for each user of the voice image stored in the storage 110 is uniform for each user may be preset.

이와 같이, 본 발명의 실시예에 따르면, 사용자의 음성 입력에 대해 음성 신호의 특징에 대응하는 음성 이미지 기반으로 인증을 수행하고, 또한 음성 이미지의 저장 개수에 따라 음성 이미지에 대한 딥러닝을 수행하여 생성되는 음성 모델을 사용한 인증을 병행 사용함으로써, 보다 효율적이면서 정확한 음성 기반 인증을 통해 기존보다 향상된 사용자 식별이 가능하다.As described above, according to an embodiment of the present invention, authentication is performed based on the voice image corresponding to the characteristics of the voice signal for the user's voice input, and deep learning is performed on the voice image according to the number of stored voice images. By concurrently using the authentication using the generated voice model, more efficient and accurate voice-based authentication can be used to identify users better than before.

이하, 전술한 음성 이미지 기반 인증부(100)의 각 구성요소별 구체적인 구성에 대해 설명한다.Hereinafter, a detailed configuration for each component of the above-described voice image-based authentication unit 100 will be described.

도 3은 도 2에 도시된 음성 특징 추출부(120)의 구체적인 구성 블록도이다.FIG. 3 is a detailed block diagram of the voice feature extraction unit 120 shown in FIG. 2 .

도 3에 도시된 바와 같이, 음성 특징 추출부(120)는 성문 펄스 주파수 추출기(121), 음성 추적 주파수 추출기(122) 및 환경 잡음 주파수 추출기(123)를 포함한다.As shown in FIG. 3 , the voice feature extraction unit 120 includes a glottal pulse frequency extractor 121 , a voice tracking frequency extractor 122 , and an environmental noise frequency extractor 123 .

일반적으로 인간의 음성 신호는 성문 펄스(glottal pulse), 음성 추적 움직임(vocal track movement) 및 환경 잡음으로 구성된다. 즉, 음성 신호의 구성요소로는 성문 펄스 구성요소, 음성 추적 움직임 구성요소 및 환경 잡음 구성요소가 포함된다.In general, a human voice signal is composed of a glottal pulse, a voice track movement, and an environmental noise. That is, the components of the speech signal include a glottal pulse component, a voice tracking motion component, and an environmental noise component.

성문 펄스 주파수 추출기(121)는 입력되는 사용자 음성 신호를 주파수 도메인 음성 신호로 변환한 후 변환된 주파수 도메인 음성 신호를 사용하여 성문 펄스 주파수 스펙트럼을 추출한다.The glottal pulse frequency extractor 121 converts the input user voice signal into a frequency domain voice signal and then extracts the glottal pulse frequency spectrum using the converted frequency domain voice signal.

음성 추적 주파수 추출기(122)는 성문 펄스 주파수 추출기(121)에서 변환된 주파수 도메인 음성 신호를 사용하여 음성 추적 주파수 스펙트럼을 추출한다.The voice tracking frequency extractor 122 extracts a voice tracking frequency spectrum using the frequency domain voice signal converted by the glottal pulse frequency extractor 121 .

환경 잡음 주파수 추출기(123)는 성문 펄스 주파수 추출기(121)에서 변환된 주파수 도메인 음성 신호를 사용하여 환경 잡음 추적 주파수 스펙트럼을 추출한다.The environmental noise frequency extractor 123 extracts the environmental noise tracking frequency spectrum using the frequency domain speech signal converted by the glottal pulse frequency extractor 121 .

도 4는 도 2에 도시된 음성 이미지 변환부(130)의 구체적인 구성 블록도이다.FIG. 4 is a detailed block diagram of the voice image conversion unit 130 shown in FIG. 2 .

도 4에 도시된 바와 같이, 음성 이미지 변환부(130)는 특징 융합기(131), 특징 벡터 생성기(132), 특징 매트릭스 관리기(133) 및 이미지 생성기(134)를 포함한다.As shown in FIG. 4 , the speech image converter 130 includes a feature fusion unit 131 , a feature vector generator 132 , a feature matrix manager 133 , and an image generator 134 .

특징 융합기(131)는 음성 특징 추출부(120)에서 추출된 성문 펄스 주파수 스펙트럼, 음성 추적 주파수 스펙트럼 및 환경 잡음 주파수 스펙트럼에 대해, 주어진 시간(t)에서 각각의 성문 펄스 주파수 스펙트럼, 음성 추적 주파수 스펙트럼 및 환경 잡음 주파수 스펙트럼에 대한 다양한 크기의 주파수 스펙트럼을 융합한다.The feature fusion unit 131 performs the glottal pulse frequency spectrum, the voice tracking frequency spectrum, and the environmental noise frequency spectrum extracted by the voice feature extraction unit 120, respectively, at a given time t, the glottal pulse frequency spectrum, and the voice tracking frequency. Fuse frequency spectra of different magnitudes into spectral and environmental noise frequency spectra.

특징 벡터 생성기(132)는 특징 융합기(131)에 의해 융합된 주파수 스펙트럼에 대해 제거 및 패딩을 통해 특징의 차원이 정렬된 특징 벡터를 생성한다.The feature vector generator 132 generates a feature vector in which the dimension of the feature is aligned through removal and padding with respect to the frequency spectrum fused by the feature fuser 131 .

특징 매트릭스 관리기(133)는 특징 벡터 생성기(132)에서 생성된 특징 벡터가 불완전한 벡터인 경우, 해당 특징 벡터를 원하는 특성 차원에 적합한 값으로 채워서 완전환 특징 벡터가 되도록 한다.When the feature vector generated by the feature vector generator 132 is an incomplete vector, the feature matrix manager 133 fills the feature vector with a value suitable for a desired feature dimension to become a complete feature vector.

이미지 생성기(134)는 특징 매트릭스 관리기(133)로부터 전달되는 특징 벡터를 대응하는 RGB 이미지로 생성한다. The image generator 134 generates a feature vector transmitted from the feature matrix manager 133 as a corresponding RGB image.

이하, 이미지 생성기(134)가 특징 매트릭스 관리기(133)로부터 전달되는 특징 벡터로부터 대응하는 RGB 이미지를 생성하는 구체적인 내용에 대해 설명한다.Hereinafter, details of the image generator 134 generating the corresponding RGB image from the feature vector transmitted from the feature matrix manager 133 will be described.

먼저, 특징 매트릭스 관리기(133)로부터 전달되는 특징 벡터는 세 단계를 통해 대응하는 RGB 이미지, 즉 음성 이미지로 생성될 수 있다.First, the feature vector transmitted from the feature matrix manager 133 may be generated as a corresponding RGB image, that is, a voice image through three steps.

첫 번째 단계에서, 특징 벡터 내에 포함된 특징들이 미리 정의된 윈도우 크기 내에서 선택된다. 여기서, 윈도우 크기는, RGB 3개의 색상에 대응하도록 3개의 하위 블록으로 분할되어야 하므로, 총 9의 배수가 고려된다.In a first step, features included in the feature vector are selected within a predefined window size. Here, since the window size must be divided into 3 sub-blocks to correspond to 3 RGB colors, a multiple of 9 is considered.

예를 들어, 특징 매트릭스 관리기(133)로부터 전달되는 특징 벡터는 도 5에 도시된 바와 같이, x축이 시간을 나타내고, y축이 특정 시간에서의 음성 특징을 나타내도록 매트릭스 형태로 구성되며, 이 때 특정 시간에서의 음성 특징은 'n'개의 값으로 구성된다. For example, the feature vector transmitted from the feature matrix manager 133 is configured in a matrix form so that the x-axis represents time and the y-axis represents the voice feature at a specific time, as shown in FIG. 5 , When a speech feature at a specific time consists of 'n' values.

이러한 특징 벡터에 대해, 윈도우 크기가 9인 경우의 예가 도 5에 도시되어 있다. 첫 번째 이미지 생성에서 시간 축에서 1~9까지의 윈도우 크기로 블록(즉, 9개의 컬럼 블록)이 선택되었다면, 다음의 두 번째 이미지 생성에서는 시간 축에서 다음 윈도우로 이동된 10~18까지의 블록이 선택될 것이며, 계속된 이미지 생성시에는 다음의 윈도우 블록으로 이동하여 선택될 것이다.For such a feature vector, an example of a case where the window size is 9 is shown in FIG. 5 . If blocks (i.e., blocks of 9 columns) with window sizes 1 to 9 on the time axis were selected in the first image generation, blocks 10 to 18 shifted to the next window on the time axis in the second image generation will be selected, and in case of continued image creation, it will be selected by moving to the next window block.

두 번째 단계에서, 첫 번째 단계에서 선택된 특징들에 대해 'R(red)' 색상의 코드 값인 '255'를 곱하여 이미지 매트릭스로 변환하고, 그런 다음, 이미지 매트릭스를 구성하는 9개의 컬럼 블록이 각각 3개의 컬럼을 포함하도록 3개의 하위 블록으로 더 분할된다.In the second step, the features selected in the first step are multiplied by '255', which is the code value of the 'R(red)' color, to be converted into an image matrix, and then the 9 column blocks constituting the image matrix are each 3 It is further divided into three sub-blocks to contain columns.

설명의 편의를 위해 도 6에 도시된 예에서 음성 특징이 9개인 경우를 가정하여 설명한다.For convenience of explanation, it is assumed that there are 9 voice features in the example shown in FIG. 6 .

도 6을 참조하면, 첫 번째에서 선택된 윈도우 크기 9인 블록, 즉 9개의 컬럼으로 구성된 블록에 포함된 각 특징에 대해 'R' 색상의 코드 값인 '255'가 곱해져서 각 특징이 '0' 또는 '255'의 값을 갖도록 이미지 매트릭스가 형성된 후 3개의 컬럼을 갖는 총 3개의 블록, 즉 블록1, 블록2, 블록3의 하위 블록으로 추가적으로 분할된다. 여기서, 3개의 블록으로 분할하는 것은 각각의 블록이 'RGB'에 대응하도록 하기 위함이다. 예를 들어, 블록1은 'R' 블록에 대응하고, 블록2는 'G' 블록에 대응하며, 블록3은 'B' 블록에 대응한다.Referring to FIG. 6 , each feature included in the block having a window size of 9 selected in the first, that is, a block consisting of nine columns is multiplied by '255', a code value of the 'R' color, so that each feature is either '0' or '0' or '255'. After the image matrix is formed to have a value of '255', it is further divided into a total of three blocks having three columns, that is, sub-blocks of block 1, block 2, and block 3. Here, the division into three blocks is for each block to correspond to 'RGB'. For example, block 1 corresponds to a 'R' block, block 2 corresponds to a 'G' block, and block 3 corresponds to a 'B' block.

마지막으로, 세 번째 단계에서, 각각의 블록, 즉 블록1, 블록2 및 블록3에 해당하는 색상 코드가 각각 (r, g, b)의 인덱스가 되어 이에 대응하는 색상이 최종적으로 생성된다. 즉, 세 개의 블록의 색상 코드를 사용하여 하나의 색상 블록을 형성하는 것이다.Finally, in the third step, color codes corresponding to each block, ie, block 1, block 2, and block 3, become indexes of (r, g, b), respectively, and a corresponding color is finally generated. That is, one color block is formed by using the color codes of three blocks.

도 7을 참조하면, 좌측의 세 개의 블록에서, 블록1의 (1, 1) 특징, 즉 시간 1에서의 첫 번째 특징인 '255'가 (r, g, b)에서 'r'의 인덱스에 해당하고, 블록의 (4, 1) 특징, 즉 시간 4에서의 첫 번째 특징인 '255'가 (r, g, b)에서 'g'의 인덱스에 해당하며, 마지막으로, 블록3의 (7, 1) 특징, 즉 시간 7에서의 첫 번째 특징인 '0'이 (r, g, b)에서 'b'의 인덱스에 해당하게 되어, 이렇게 세 개의 블록에서 서로 대응하는 위치에 있는 특징에 의해 하나의 색상 코드, 예를 들어 (255, 255, 0), 즉 '노랑(Yellow)'의 색상 코드가 생성된다.Referring to FIG. 7 , in the three blocks on the left, the (1, 1) feature of block 1, that is, '255', which is the first feature at time 1, is at the index of 'r' in (r, g, b). The (4, 1) feature of the block, that is, the first feature at time 4, '255', corresponds to the index of 'g' in (r, g, b), and finally, the (7 , 1) The feature, that is, '0', the first feature at time 7, corresponds to the index of 'b' in (r, g, b), One color code, for example, (255, 255, 0), that is, a color code of 'Yellow' is generated.

따라서, 총 세 개의 블록, 즉 블록1, 블록2, 블록3을 각각 R 블록, G 블록, B 블록에 대응하도록 하여 세 개의 블록이 형성하는 색상 코드를 사용함으로써, 도 7의 우측에서와 같은 RGB 음성 이미지가 생성될 수 있다.Accordingly, by using the color codes formed by the three blocks by making a total of three blocks, that is, block 1, block 2, and block 3 correspond to the R block, G block, and B block, respectively, RGB as shown in the right side of FIG. A voice image may be generated.

전술한 내용은 특징 벡터에서 9의 윈도우 크기에 해당하는 블록에 대해서 RGB 음성 이미지를 생성하는 것에 해당되지만, 이러한 내용을 크기가 9인 윈도우를 시간축 기준으로 이동하면서 적용함으로써 모든 특징 벡터에 대해 대응하는 RGB 음성 이미지가 생성될 수 있음은 본 기술분야의 당업자라면 쉽게 이해할 것이다.The above description corresponds to generating an RGB audio image for a block corresponding to a window size of 9 in a feature vector, but by applying these contents while moving a window of size 9 based on the time axis, corresponding to all feature vectors It will be readily understood by those skilled in the art that RGB audio images can be generated.

다음, 도 8은 도 2에 도시된 적응형 딥러닝 수행부(140)의 구체적인 구성 블록도이다.Next, FIG. 8 is a detailed block diagram of the adaptive deep learning performing unit 140 shown in FIG. 2 .

도 8에 도시된 바와 같이, 적응형 딥러닝 수행부(140)는 훈련 데이터 관리기(141), 데이터 전처리기(142) 및 딥러닝 모델 훈련기(143)를 포함한다.As shown in FIG. 8 , the adaptive deep learning performing unit 140 includes a training data manager 141 , a data preprocessor 142 , and a deep learning model trainer 143 .

훈련 데이터 관리기(141)는 저장부(110)에 저장된 음성 이미지, 즉 등록된 사용자별 음성 이미지와 로그인시 수집된 음성 이미지의 총 개수를 모니터링하고, 모니터링되는 음성 이미지의 총 개수가 미리 설정된 임계값, 즉 이미지 임계값에 도달할 때마다 음성 이미지를 로딩한다.The training data manager 141 monitors the total number of voice images stored in the storage 110, that is, the voice images for each registered user and the voice images collected at login, and the total number of monitored voice images is a preset threshold. , that is, load the audio image whenever the image threshold is reached.

데이터 전처리기(142)는 훈련 데이터 관리기(141)에 의해 로딩되는 음성 이미지에 대한 분할(Segmentation), 이미지 크기 설정 및 확인 등의 전처리 작업을 수행한다.The data preprocessor 142 performs preprocessing tasks such as segmentation, image size setting and confirmation on the voice image loaded by the training data manager 141 .

딥러닝 모델 훈련기(143)는 데이터 전처리기(142)에 의해 전처리 작업이 수행된 음성 이미지들에 대해 이미 잘 알려져 있는 기계 학습 방식 중 하나인 딥러닝 학습을 수행하여 대응하는 음성 모델(145)을 생성한다. 생성된 음성 모델(145)은 모델 저장소(144)에 저장되며, 이러한 모델 저장소(144)는 저장부(110) 내에 포함될 수 있다.The deep learning model trainer 143 performs deep learning learning, which is one of the well-known machine learning methods, on the voice images on which the preprocessing operation has been performed by the data preprocessor 142 to obtain the corresponding voice model 145 . create The generated speech model 145 is stored in the model storage 144 , and this model storage 144 may be included in the storage 110 .

전술한 바와 같이, 딥러닝 모델 훈련기(143)는 훈련 데이터 관리기(141)에 저장부(110)에 저장된 음성 이미지들이 미리 설정된 임계값에 도달할 때마다 음성 이미지가 로딩되어 데이터 전처리기(142)에 의해 전처리 수행될 때마다 해당 음성 이미지들에 대한 딥러닝 학습을 수행하여 파인 튜닝된 딥러닝 음성 모델을 생성할 수 있다. As described above, the deep learning model trainer 143 is loaded with a voice image whenever the voice images stored in the storage unit 110 in the training data manager 141 reach a preset threshold value, and the data preprocessor 142 . Whenever pre-processing is performed by , deep learning learning is performed on the corresponding voice images to generate a fine-tuned deep learning voice model.

예를 들어, 미리 설정된 임계값이 50개인 경우, 저장부(110)에 저장된 음성 이미지 개수가 50개가 되는 때가 훈련 데이터 관리기(141)에 의해 검출되어 50개의 음성 이미지에 대한 딥러닝 학습이 수행되어 대응하는 음성 모델이 생성되고, 그 후 새로운 음성이 입력되어 대응하는 음성 이미지가 100개가 되는 경우에 다시 훈련 데이터 관리기(141)에 의해 검출되어 100개의 음성 이미지에 대한 딥러닝 학습이 수행되어 기존의 음성 모델이 파인 튜닝된 음성 모델이 생성되는 것과 같이, 임계값인 50개의 개수만큼씩 새로운 이미지가 추가 저장될 때마다 딥러닝 학습이 계속 수행된다. 즉, 저장부(110)에 저장된 음성 이미지 개수가 임계값인 50개의 배수가 될 때마다 저장된 음성 이미지에 대한 딥러닝 학습이 수행된다.For example, when the preset threshold value is 50, when the number of voice images stored in the storage unit 110 becomes 50, the training data manager 141 detects that deep learning learning for 50 voice images is performed. A corresponding voice model is generated, and after that, when a new voice is input and the corresponding voice image becomes 100, it is detected by the training data manager 141 again and deep learning learning is performed on 100 voice images. Just as a voice model with a fine-tuned voice model is generated, deep learning learning continues whenever new images are additionally stored by the number of 50 threshold values. That is, whenever the number of voice images stored in the storage 110 is a multiple of the threshold of 50, deep learning learning is performed on the stored voice images.

다음, 도 9는 도 2에 도시된 신원 확인부(150)의 구체적인 구성 블록도이다.Next, FIG. 9 is a detailed block diagram of the identification unit 150 shown in FIG. 2 .

도 9에 도시된 바와 같이, 신원 확인부(150)는 판단기(151), 신원 생성기(152), 사용자 인증기(153), 모델 인증기(154) 및 인증 결과 생성기(155)를 포함한다.As shown in FIG. 9 , the identity verification unit 150 includes a determiner 151 , an identity generator 152 , a user authenticator 153 , a model authenticator 154 , and an authentication result generator 155 . .

판단기(151)는 음성 이미지 변환부(130)에서 생성된 음성 이미지가 사용자 등록을 위한 음성 이미지인지의 여부를 판단하고, 만약 음성 이미지가 사용자 등록을 위한 음성 이미지인 것으로 판단되는 경우, 음성 이미지 변환부(130)에서 생성된 음성 이미지를 신원 생성기(152)로 전달한다.The determiner 151 determines whether the voice image generated by the voice image converter 130 is a voice image for user registration, and if it is determined that the voice image is a voice image for user registration, the voice image The voice image generated by the converter 130 is transferred to the identity generator 152 .

그러나, 판단기(151)가 음성 이미지 변환부(130)에서 생성된 음성 이미지가 사용자 등록을 위한 음성 이미지가 아니라 사용자 로그인을 위한 음성 이미지인 것으로 판단되는 경우, 저장부(110), 구체적으로는 모델 저장소(144)에 음성 모델(145)이 저장되어 있는지를 판단한다.However, when the determiner 151 determines that the voice image generated by the voice image conversion unit 130 is not a voice image for user registration but a voice image for user login, the storage unit 110, specifically It is determined whether the voice model 145 is stored in the model storage 144 .

만약 모델 저장소(144)에 음성 모델(145)이 저장되어 있지 않은 것으로 판단되면, 음성 이미지 변환부(130)에서 생성된 음성 이미지를 사용자 인증기(153)로만 전달한다.If it is determined that the voice model 145 is not stored in the model storage 144 , the voice image generated by the voice image converter 130 is transmitted only to the user authenticator 153 .

그러나, 모델 저장소(144)에 음성 모델(145)이 저장되어 있는 것으로 판단되면, 음성 이미지 변환부(130)에서 생성된 음성 이미지를 사용자 인증기(153)와 모델 인증기(154)로 각각 전달한다.However, if it is determined that the voice model 145 is stored in the model storage 144 , the voice image generated by the voice image converter 130 is transmitted to the user authenticator 153 and the model authenticator 154 , respectively. do.

신원 생성기(152)는 사용자 등록 과정인 경우에 작동되며, 판단기(151)로부터 전달되는 음성 이미지를 사용자 단말의 식별자, 예를 들어 IMEI 번호와 바인딩하여 저장부(110)에 저장한다. The identity generator 152 operates in the case of user registration, binds the voice image transmitted from the determiner 151 to an identifier of the user terminal, for example, an IMEI number, and stores it in the storage 110 .

사용자 인증기(153)는 사용자 인증 과정, 즉 사용자 로그인시에 작동되며, 판단기(151)로부터 전달되는 음성 이미지와 저장부(110)에 저장된 등록된 사용자별 음성 이미지 및 로그인시 사용된 음성 이미지와의 유사도를 판단하고, 판단된 유사도가 미리 설정된 임계값 이상인 경우, 유사도가 판단된 저장부(110)에 저장된 등록된 사용자별 음성 이미지 및 로그인시 사용된 음성 이미지별 유사도 중에서 가장 높은 유사도를 갖는 사용자로 인증한 후, 인증 결과(제1 인증 결과)를 인증 결과 생성기(155)로 전달한다. 예를 들어, 저장부(110)에 A, B 및 C 3명의 사용자에 대한 음성 이미지가 저장되어 있고, 미리 설정된 임계값이 50인 경우, 로그인을 위해 입력된 사용자 음성 신호에 대응하여 생성된 음성 이미지와의 유사도 판단에 의해 A일 확률이 80%이고, B일 확률이 55%이며, C일 확률이 20%인 경우, 유사도가 미리 설정된 임계값인 50 이상인 사용자 A와 B 중에서 유사도가 가장 높은 사용자 A가 음성 이미지 기반 인증 결과로서 출력된다. 여기서, 각 사용자별로 다수의 음성 이미지가 저장되어 있는 경우 각 음성 이미지에 대해 산출된 유사도의 평균값을 해당 사용자의 유사도로 설정한다.The user authenticator 153 operates during a user authentication process, that is, when a user logs in, and a voice image transmitted from the determiner 151, a voice image for each registered user stored in the storage unit 110, and a voice image used at login and the similarity is determined, and when the determined similarity is equal to or greater than a preset threshold, having the highest similarity among the voice images for each registered user stored in the storage unit 110 for which the similarity is determined and the similarity for each voice image used at login After authenticating as a user, the authentication result (first authentication result) is transmitted to the authentication result generator 155 . For example, when voice images for three users A, B, and C are stored in the storage 110 and the preset threshold value is 50, a voice generated in response to a user voice signal input for login According to the similarity determination with the image, if the probability of A is 80%, the probability of B is 55%, and the probability of C is 20%, the similarity is the highest among users A and B, which is 50 or more, a preset threshold. User A is output as a voice image-based authentication result. Here, when a plurality of voice images are stored for each user, the average value of the similarity calculated for each voice image is set as the similarity of the corresponding user.

또한, 사용자 인증기(153)는 판단기(151)로부터 전달되는 음성 이미지에 대해 사용자 인증이 수행되면, 인증된 사용자에 대응하도록 해당 음성 이미지를 저장부(110)에 저장하며, 이 때 해당 음성 이미지는 인증된 사용자의 식별자와 바인딩되어 저장된다.In addition, when user authentication is performed on the voice image transmitted from the determiner 151 , the user authenticator 153 stores a corresponding voice image in the storage 110 to correspond to the authenticated user, and at this time, the voice image is stored in the storage unit 110 . The image is stored in binding with the authenticated user's identifier.

모델 인증기(154)는 사용자 인증 과정, 즉 사용자 로그인시이면서 모델 저장소(144)에 음성 모델이 생성되어 저장되어 있는 경우에 작동되며, 판단기(151)로부터 전달되는 음성 이미지에 대해 모델 저장소(144)에 저장되어 있는 음성 모델을 적용하여 사용자 인증을 수행한 후, 인증 결과(제2 인증 결과)를 인증 결과 생성기(155)로 전달한다.The model authenticator 154 operates when a voice model is created and stored in the model storage 144 during the user authentication process, that is, during user login, and for the voice image transmitted from the determiner 151, the model storage ( After user authentication is performed by applying the voice model stored in 144 , the authentication result (second authentication result) is transmitted to the authentication result generator 155 .

인증 결과 생성기(155)는 사용자 인증기(153)에서 생성된 제1 인증 결과와 모델 인증기(154)에서 생성된 제2 인증 결과를 수신하여 최종 인증 결과를 생성하여 출력한다. 이 때, 모델 저장소(144)에 음성 모델이 저장되어 있지 않아 모델 인증기(154)로부터 제2 인증 결과가 수신되지 않은 경우에는 사용자 인증기(153)로부터 수신되는 제1 인증 결과를 최종 인증 결과로서 출력한다. 그러나, 모델 저장소(144)에 음성 모델이 저장되어 있어서 모델 인증기(154)로부터의 인증이 수행되어 제2 인증 결과가 수신된 경우에는 사용자 인증기(153)의 제1 인증 결과와 모델 인증기(154)의 제2 인증 결과 중에서 어느 하나를 최종 인증 결과로서 생성하여 출력한다. 이 때, 인증 결과 생성기(155)는 저장부(110)에 저장된 음성 이미지의 총 개수와 저장부(110)에 저장된 사용자마다의 음성 이미지 비율에 따라, 제1 인증 결과와 제2 인증 결과 중 하나의 인증 결과를 최종 인증 결과로 사용한다. 전술한 바와 같이, 예를 들어, 저장부(110)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등하지 않으면, 인증 결과 생성기(155)는 제1 인증 결과를 최종 인증 결과로 생성하여 출력한다. 그러나, 저장부(110)에 저장된 음성 이미지의 사용자마다의 음성 이미지 비율이 사용자별로 균등하고 총 개수가 딥러닝의 결과를 신뢰하기에 충분한 경우에는 인증 결과 생성기(155)는 제2 인증 결과를 최종 인증 결과로 생성할 수 있다. 이 때, 총 개수가 딥러닝의 결과를 신뢰하기에 충분한 경우는 미리 설정된 임계 개수에 의해 판단될 수 있다. 그러나, 저장부(110)에 저장된 음성 이미지의 사용자마다의 음성 이미지 비율이 사용자별로 균등하더라도 총 개수가 딥러닝의 결과를 신뢰하기에 충분하지 않은 경우에는, 인증 결과 생성기(155)는 제1 인증 결과를 최종 인증 결과로 생성할 수 있다. 여기서, 저장부(110)에 저장된 음성 이미지의 사용자마다의 음성 이미지 비율이 사용자별로 균등한지의 여부를 판단하는 범위 또한 미리 설정될 수 있다.The authentication result generator 155 receives the first authentication result generated by the user authenticator 153 and the second authentication result generated by the model authenticator 154 , and generates and outputs the final authentication result. At this time, when the second authentication result is not received from the model authenticator 154 because the voice model is not stored in the model storage 144 , the first authentication result received from the user authenticator 153 is the final authentication result output as However, when the voice model is stored in the model storage 144 and authentication from the model authenticator 154 is performed and the second authentication result is received, the first authentication result of the user authenticator 153 and the model authenticator Any one of the second authentication results in (154) is generated and output as the final authentication result. At this time, the authentication result generator 155 selects one of the first authentication result and the second authentication result according to the total number of voice images stored in the storage 110 and the ratio of voice images for each user stored in the storage 110 . The authentication result of is used as the final authentication result. As described above, for example, if the voice image ratio for each user stored in the storage 110 is not uniform for each user, the authentication result generator 155 generates and outputs the first authentication result as the final authentication result. However, if the voice image ratio for each user of the voice image stored in the storage unit 110 is uniform for each user and the total number is sufficient to trust the result of deep learning, the authentication result generator 155 finalizes the second authentication result. It can be created as a result of authentication. At this time, if the total number is sufficient to trust the result of deep learning, it may be determined by a preset threshold number. However, even if the voice image ratio for each user of the voice image stored in the storage unit 110 is uniform for each user, if the total number is not sufficient to trust the result of deep learning, the authentication result generator 155 performs the first authentication The result can be generated as the final authentication result. Here, the range for determining whether the voice image ratio for each user of the voice image stored in the storage unit 110 is uniform for each user may also be preset.

이하, 도면을 참조하여, 본 발명의 실시예에 따른 음성 이미지 기반 인증 방법에 대해 설명한다.Hereinafter, a voice image-based authentication method according to an embodiment of the present invention will be described with reference to the drawings.

도 10은 본 발명의 실시예에 따른 음성 이미지 기반 인증 방법의 개략적인 흐름도이다.10 is a schematic flowchart of a voice image-based authentication method according to an embodiment of the present invention.

도 10을 참조하면, 사용자로부터 음성 입력을 받는다(S100).Referring to FIG. 10 , a voice input is received from the user (S100).

입력된 음성에 대해 특징 벡터가 생성되고(S110), 생성된 특징 벡터에 대응하는 음성 이미지가 생성된다(S120).A feature vector is generated for the input voice (S110), and a voice image corresponding to the generated feature vector is generated (S120).

그 후, 사용자 음성 입력이 사용자 등록을 위한 입력인지가 판단되고(S130), 만약 사용자 등록을 위한 사용자 음성 입력인 경우에는 생성된 음성 이미지가 사용자 식별자와 함께 바인딩되어 저장부(110)에 저장된다(S140).Thereafter, it is determined whether the user voice input is an input for user registration ( S130 ), and if it is a user voice input for user registration, the generated voice image is bound together with the user identifier and stored in the storage unit 110 . (S140).

그러나, 상기 단계(S130)에서 사용자 음성 입력이 사용자 로그인을 위한 음성 입력인 것으로 판단되는 경우, 저장부(110)에 음성 모델이 저장되어 있는지가 판단된다(S150). 즉, 저장부(110)에 저장된 음성 이미지의 개수가 미리 설정된 임계값의 배수에 도달한 경우에 저장된 음성 이미지에 대한 딥러닝 학습이 수행되어 대응하는 음성 모델이 이미 생성된 것인지가 판단되는 것이다.However, when it is determined that the user's voice input is a voice input for user login in step S130, it is determined whether a voice model is stored in the storage 110 (S150). That is, when the number of voice images stored in the storage 110 reaches a multiple of a preset threshold, deep learning learning is performed on the stored voice images to determine whether a corresponding voice model has already been generated.

만약 저장부(110)에 음성 모델이 저장되어 있지 않은 경우에는, 생성된 음성 이미지와 저장부(110)에 저장되어 있는 음성 이미지와의 비교를 통해 사용자 인증을 수행하고 인증 결과를 출력한다(S160).If the voice model is not stored in the storage unit 110, user authentication is performed by comparing the generated voice image with the voice image stored in the storage unit 110, and the authentication result is output (S160). ).

그러나, 저장부(110)에 음성 모델이 저장되어 있는 경우에는, 생성된 음성 이미지와 저장부(110)에 저장되어 있는 음성 이미지와의 비교를 통해 수행된 제1 인증 결과와 생성된 음성 이미지에 대해 저장부(110)에 저장되어 있는 음성 모델을 적용하여 수행된 제2 인증 결과를 생성한다(S170).However, when the voice model is stored in the storage unit 110 , the result of the first authentication performed by comparing the generated voice image with the voice image stored in the storage unit 110 and the generated voice image For the second authentication result performed by applying the voice model stored in the storage unit 110 is generated (S170).

그 후, 저장부(110)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등한지의 여부가 판단된다(S180). Thereafter, it is determined whether the voice image ratio for each user stored in the storage unit 110 is uniform for each user (S180).

만약 저장부(110)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등한 것으로 판단되는 경우, 저장부(110)에 저장된 총 음성 이미지의 개수가 미리 설정된 임계 개수 이상이어서 딥러닝의 결과를 신뢰하기에 충분한지의 여부가 판단된다(S190).If it is determined that the ratio of voice images for each user stored in the storage 110 is equal for each user, the total number of voice images stored in the storage 110 is greater than or equal to a preset threshold number to trust the results of deep learning It is determined whether or not it is sufficient (S190).

만약 저장부(110)에 저장된 총 음성 이미지의 개수가 미리 설정된 임계 개수 이상이어서 딥러닝의 결과를 신뢰하기에 충분한 것으로 판단되는 경우, 제2 인증 결과를 최종 인증 결과로서 출력한다(S200).If it is determined that the total number of voice images stored in the storage 110 is greater than or equal to a preset threshold number and thus sufficient to trust the deep learning result, the second authentication result is output as the final authentication result (S200).

그러나, 상기 단계(S180)에서 저장부(110)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등하지 않은 것으로 판단되거나, 또는 상기 단계(S190)에서 저장부(110)에 저장된 총 음성 이미지의 개수가 미리 설정된 임계 개수보다 작아서 딥러닝의 결과를 신뢰하기에 충분하지 않은 것으로 판단되는 경우에는 제1 인증 결과를 최종 인증 결과로서 출력한다(S210).However, it is determined that the ratio of voice images for each user stored in the storage unit 110 in step S180 is not uniform for each user, or the total number of voice images stored in the storage unit 110 in step S190. When it is determined that it is not sufficient to trust the result of deep learning because is smaller than the preset threshold number, the first authentication result is output as the final authentication result (S210).

이하, 본 발명의 다른 실시예에 따른 음성 이미지 기반 인증 장치에 대해 설명한다.Hereinafter, a voice image-based authentication apparatus according to another embodiment of the present invention will be described.

도 11은 본 발명의 다른 실시예에 따른 음성 이미지 기반 인증 장치(200)의 개략적인 구성을 나타내는 도면이다.11 is a diagram showing a schematic configuration of a voice image-based authentication apparatus 200 according to another embodiment of the present invention.

도 11에 도시된 바와 같이, 본 발명의 다른 실시예에 따른 음성 이미지 기반 인증 장치(200)는 적어도 하나의 프로세서(210), 메모리(220), 통신기(230), 입출력기(240) 및 통신 버스(250)를 포함한다.11 , the voice image-based authentication apparatus 200 according to another embodiment of the present invention includes at least one processor 210 , a memory 220 , a communicator 230 , an input/output device 240 , and communication. Bus 250 is included.

프로세서(210)는 범용 CPU(Central Processing Unit), 마이크로프로세서, ASIC(Application-Specific Integrated Circuit), 또는 본 출원의 해결수단에서 프로그램 실행을 제어하기 위한 하나 이상의 집적 회로일 수 있다.The processor 210 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling program execution in the solution of the present application.

메모리(220)는 본 발명의 실시예에 따른 음성 이미지 기반 인증 장치와 관련된 정보를 저장한다.The memory 220 stores information related to the voice image-based authentication apparatus according to an embodiment of the present invention.

구체적으로, 메모리(220)는 코드의 집합을 저장하도록 추가로 구성되고, 그 코드는 다음과 같은 프로세스를 실행하기 위해 프로세서(210)를 제어하는 데 사용된다. 이러한 프로세스는, 통신기(230) 또는 입출력기(240)를 통해 사용자로부터 음성 입력을 수신하는 프로세스, 입력된 음성에 대응하는 음성 이미지를 생성하는 프로세스, 음성 이미지를 사용자의 식별자와 바인딩하여 메모리(220)에 저장하는 프로세스, 생성된 음성 이미지와 저장된 음성 이미지를 비교하여 사용자에 대한 제1 인증 결과를 생성하는 프로세스, 메모리(200)에 저장된 음성 이미지에 대한 딥러닝 학습을 수행하여 대응하는 음성 모델을 생성하는 프로세스, 생성된 음성 이미지에 음성 모델을 적용하여 사용자에 대한 제2 인증 결과를 생성하는 프로세스, 제1 인증 결과와 제2 인증 결과를 사용하여 최종 인증 결과를 생성하는 프로세스를 포함한다. Specifically, the memory 220 is further configured to store a set of codes, and the codes are used to control the processor 210 to execute a process as follows. These processes include a process of receiving a voice input from a user through the communicator 230 or the input/output device 240 , a process of generating a voice image corresponding to the input voice, and binding the voice image to an identifier of the user to the memory 220 . ), a process of generating a first authentication result for the user by comparing the generated voice image with the stored voice image, deep learning learning on the voice image stored in the memory 200 to obtain a corresponding voice model and a process for generating a second authentication result for the user by applying a voice model to the generated voice image, and a process for generating a final authentication result using the first authentication result and the second authentication result.

또한, 프로세서(210)는 메모리(220)에 저장된 음성 이미지가 미리 설정된 임계값의 배수에 도달하는 경우, 메모리(220)에 저장된 음성 이미지에 대한 딥러닝 학습을 수행하여 대응하는 음성 모델을 생성하는 프로세스를 더 실행한다.In addition, when the voice image stored in the memory 220 reaches a multiple of a preset threshold, the processor 210 performs deep learning learning on the voice image stored in the memory 220 to generate a corresponding voice model. Run more processes.

또한, 프로세서(210)는 메모리(220)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등한지의 여부를 판단하는 프로세스, 메모리(220)에 저장된 음성 이미지의 개수가 미리 설정된 임계 개수 이상인지를 판단하는 프로세스를 더 실행하고, 제1 인증 결과와 제2 인증 결과를 사용하여 최종 인증 결과를 생성하는 프로세스는, 메모리(220)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등하지 않거나, 또는 메모리(220)에 저장된 음성 이미지의 개수가 미리 설정된 임계 개수보다 작은 경우 제1 인증 결과를 최종 인증 결과로서 출력하는 프로세스, 그리고 메모리(220)에 저장된 사용자마다의 음성 이미지 비율이 사용자별로 균등하고 메모리(220)에 저장된 음성 이미지의 개수가 미리 설정된 임계 개수 이상인 경우 제2 인증 결과를 최종 인증 결과로서 출력하는 프로세스를 포함한다. In addition, the processor 210 is a process of determining whether the ratio of voice images for each user stored in the memory 220 is equal to each user, and determining whether the number of voice images stored in the memory 220 is equal to or greater than a preset threshold number In the process of generating a final authentication result using the first authentication result and the second authentication result, the voice image ratio for each user stored in the memory 220 is not uniform for each user, or the memory ( When the number of voice images stored in 220 is smaller than a preset threshold number, the process of outputting the first authentication result as the final authentication result, and the rate of voice images for each user stored in the memory 220 are equal for each user and the memory 220 ) and outputting a second authentication result as a final authentication result when the number of voice images stored in the preset threshold number or more is greater than or equal to a preset threshold number.

메모리(220)는 ROM(Read-Only Memory) 또는 명령을 저장할 수 있는 다른 유형의 정적 저장 장치, 또는 RAM(Random Access Memory) 또는 정보 및 명령을 저장할 수 있는 다른 유형의 동적 저장 장치일 수 있거나, 또는 EEPROM(Electrically Erasable Programmable Read-Only Memory), CD-ROM(Compact Disc Read-Only Memory) 또는 다른 컴팩트 디스크 저장 장치 또는 광 디스크 저장 장치(압축 광 디스크, 레이저 디스크, 광 디스크, 디지털 다용도 디스크, 블루레이 디스크 등을 포함함), 자기 디스크 저장 매체 또는 다른 자기 저장 장치, 또는 명령 또는 데이터 구조의 형태로 예상 프로그램 코드를 운반하거나 저장할 수 있으면서 컴퓨터에 의해 액세스될 수 있는 임의의 다른 매체일 수 있으며, 이것은 제한되지 않는다. 메모리(220)는 독립적으로 존재할 수 있으며, 통신 버스(250)에 의해 프로세서(210)에 연결된다. Memory 220 can be read-only memory (ROM) or other type of static storage that can store instructions, or random access memory (RAM) or other type of dynamic storage that can store information and instructions; or Electronically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other compact disc storage device or optical disc storage device (compressed optical disc, laser disc, optical disc, digital versatile disc, blue ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can carry or store expected program code in the form of instructions or data structures and that can be accessed by a computer; This is not limited. The memory 220 may exist independently and is coupled to the processor 210 by a communication bus 250 .

통신기(230)는 다른 장치 또는 통신 네트워크와 통신을 수행하며, 다양한 통신 기술로 구현될 수 있다. 즉, 와이파이(WIFI), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), HSPA(High Speed Packet Access), 모바일 와이맥스(Mobile WiMAX), 와이브로(WiBro), LTE(Long Term Evolution), 블루투스(bluetooth), 적외선통신(IrDA, infrared data association), NFC(Near Field Communication), 지그비(Zigbee), 무선랜 기술, USB(Universal Serial Bus) 등이 적용될 수 있다. 또한, 인터넷과 연결되어 서비스를 제공하는 경우 인터넷에서 정보 전송을 위한 표준 프로토콜인 TCP/IP를 따를 수 있다.The communicator 230 communicates with other devices or communication networks, and may be implemented using various communication technologies. That is, Wi-Fi (WIFI), WCDMA (Wideband CDMA), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access), HSPA (High Speed Packet Access), Mobile WiMAX (Mobile WiMAX), WiBro (WiBro) , LTE (Long Term Evolution), Bluetooth (bluetooth), infrared data association (IrDA), NFC (Near Field Communication), Zigbee, wireless LAN technology, USB (Universal Serial Bus), etc. can be applied. . In addition, when a service is provided by being connected to the Internet, TCP/IP, which is a standard protocol for information transmission on the Internet, may be followed.

입출력기(240)는 구체적으로는 입력 장치(241)와 출력 장치(242)로 구성되며, 입력 장치(241)는 프로세서(210)와 통신하고, 복수의 방식으로 사용자의 입력을 수신할 수 있다. 예를 들어, 입력 장치(241)는 마우스, 키보드, 터치 스크린 또는 센싱 장치일 수 있다. 출력 장치(242)는 프로세서(210)와 통신하고, 복수의 방식으로 정보를 디스플레이하거나 음성을 출력할 수 있다. 예를 들어, 출력 장치(242)는 LCD(Liquid Crystal Display, LCD), LED(Light Emitting Diode, LED) 디스플레이, OLED(Organic Light Emitting Diode) 디스플레이, 스피커 등일 수 있다. The input/output device 240 specifically includes an input device 241 and an output device 242 , and the input device 241 may communicate with the processor 210 and receive user input in a plurality of ways. . For example, the input device 241 may be a mouse, a keyboard, a touch screen, or a sensing device. The output device 242 may communicate with the processor 210 and display information or output voice in a plurality of ways. For example, the output device 242 may be a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a speaker, or the like.

통신 버스(250)는 적응형 추천 장치(200)의 모든 컴포넌트들, 즉 프로세서(210), 메모리(220), 통신기(230) 및 입출력기(240)를 결합하도록 구성된다. The communication bus 250 is configured to couple all the components of the adaptive recommendation apparatus 200 , namely the processor 210 , the memory 220 , the communicator 230 and the input/output unit 240 .

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiment of the present invention described above is not implemented only through the apparatus and method, but may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improved forms of the present invention are also provided by those skilled in the art using the basic concept of the present invention as defined in the following claims. is within the scope of the right.

Claims

A voice feature extraction unit for extracting voice features from the voice input from the user;
a voice image converter for generating a voice image corresponding to the voice feature extracted by the voice feature extraction unit;
a storage unit for storing the voice image generated by the voice image conversion unit;
An adaptive deep learning performing unit for generating a corresponding voice model by performing deep learning learning on the voice image stored in the storage unit, and
The first authentication result is generated by performing user authentication by comparing the voice image generated by the voice image converter with the voice image for each user stored in the storage, and the voice image generated by the voice image converter is For user authentication by applying the voice model generated by the adaptive deep learning performing unit, a second authentication result is generated, and a final authentication result is generated using the first authentication result and the second authentication result identity verification department
including,
The voice feature extraction unit,
After converting the voice input from the user into a frequency domain voice signal, the glottal pulse frequency spectrum, the voice tracking frequency spectrum, and the environmental noise tracking frequency spectrum are respectively extracted,
The voice image conversion unit,
a feature fusion device that fuses the glottal pulse frequency spectrum, the voice tracking frequency spectrum, and the environmental noise tracking frequency spectrum extracted by the voice feature extraction unit;
a feature vector generator that generates a feature vector in which features are aligned with respect to the frequency spectrum fused by the feature fuser;
When the feature vector generated by the feature vector generator is an incomplete vector, a feature matrix manager for generating a complete feature vector by filling the feature vector with values for a desired feature dimension; and
An image generator that generates a feature vector generated by the feature matrix manager as a voice image that is a corresponding RGB (Red, Green, Blue) image
Including, voice image-based user authentication device.

delete

According to claim 1,
The adaptive deep learning performing unit,
Training data manager for loading the voice images stored in the storage unit whenever the number of voice images stored in the storage reaches a multiple of a preset first threshold value;
A data preprocessor for performing preprocessing tasks including segmentation and image size setting for the voice image loaded by the training data manager, and
Deep learning model trainer for generating a corresponding voice model by performing deep learning learning on the voice image preprocessed by the data preprocessor
Including, voice image-based user authentication device.

5. The method of claim 4,
The deep learning model trainer generates a fine-tuned voice model by performing deep learning on the voice image whenever the number of voice images stored in the storage reaches a multiple of a preset first threshold value,
Voice image-based user authentication device.

According to claim 1,
The identification unit,
It is determined whether the voice image generated by the voice image conversion unit is a voice image for user registration or a voice image for user login, and a voice model generated by the adaptive deep learning performing unit in the storage unit a determiner to determine whether it is stored or not;
When it is determined by the determiner that the voice image generated by the voice image converter is a voice image for user registration, the voice image generated by the voice image converter is bound to the identifier of the user and stored in the storage unit identity generator,
When it is determined by the determiner that the voice image generated by the voice image converter is a voice image for user login, the voice image generated by the voice image converter and the voice image for each user stored in the storage unit are compared A user authenticator for generating the first authentication result by performing user authentication by
When it is determined by the determiner that the voice image generated by the voice image conversion unit is a voice image for user login, and it is determined that the voice model generated by the adaptive deep learning performing unit is stored in the storage unit , a model authenticator for generating the second authentication result by performing user authentication by applying the voice model stored in the storage unit to the voice image generated by the voice image conversion unit, and
When the second authentication result is not generated by the model authenticator, the first authentication result is generated as the final authentication result, or when the second authentication result is generated by the model authenticator, the first authentication An authentication result generator that generates one of a result and the second authentication result as the final authentication result
Including, voice image-based user authentication device.

7. The method of claim 6,
The authentication result generator,
generating one of the first authentication result and the second authentication result as the final authentication result according to the total number of voice images stored in the storage and the ratio of voice images for each user stored in the storage,
Voice image-based user authentication device.

8. The method of claim 7,
The authentication result generator,
When the ratio of voice images for each user stored in the storage unit is equal for each user and the total number of voice images stored in the storage unit is equal to or greater than a preset threshold number, the second authentication result is generated as the final authentication result, or
To generate the first authentication result as the final authentication result when the ratio of voice images for each user stored in the storage is not uniform for each user or the total number of voice images stored in the storage is less than the preset threshold number ,
Voice image-based user authentication device.

8. The method of claim 7,
When the first authentication result is generated, the user authenticator stores the voice image generated by the voice image conversion unit as a user voice image input during a login attempt in the storage unit,
The total number of voice images stored in the storage unit used in the authentication result generator includes a user voice image input at the time of the login attempt,
Voice image-based user authentication device.

According to claim 1,
The voice image conversion unit,
selecting a feature block having a predefined window size from a feature vector corresponding to the speech feature extracted by the speech feature extraction unit;
After converting into an image matrix by multiplying each feature in the feature block of the predefined window size by '255', which is the code value of 'R (red)', the transformed image matrix is divided into three sub-blocks, ,
After setting the three divided sub-blocks as R (red) blocks, G (Green) blocks, and B (Blue) blocks, respectively, (r, g) , b) to set the color of the position corresponding to each other as a color corresponding to the color code
generating the voice feature extracted by the voice feature extraction unit as a corresponding RGB voice image in this way,
Voice image-based user authentication device.

A method for a voice image-based user authentication device to perform user authentication based on a voice image corresponding to a voice input from a user,
generating a voice image corresponding to a voice feature extracted from a voice input from a user;
determining whether there is a pre-generated speech model, and
When there is no voice model, a first authentication result is generated by performing user authentication by comparing the voice image with a previously stored voice image for each user, or
When there is the voice model, user authentication is performed by comparing the generated voice image with the previously stored voice image for each user to generate a first authentication result, and user authentication is performed by applying the voice model to the generated voice image. After generating a second authentication result by performing the process, generating a final authentication result using the first authentication result and the second authentication result
includes,
The step of generating the audio image includes:
converting the voice input from the user into a frequency domain voice signal and then extracting a glottal pulse frequency spectrum, a voice tracking frequency spectrum, and an environmental noise tracking frequency spectrum, respectively;
fusing the extracted glottal pulse frequency spectrum, voice tracking frequency spectrum, and environmental noise tracking frequency spectrum;
generating a feature vector with features aligned to the fused frequency spectrum;
If the generated feature vector is an incomplete vector, filling the feature vector with a value for a desired feature dimension to generate a complete feature vector, and
Generating the complete feature vector into a voice image that is a corresponding RGB (Red, Green, Blue) image
Including, voice image-based user authentication method.

12. The method of claim 11,
Between the step of generating the speech image and the step of determining whether there is a speech model,
storing the generated voice image in a storage device; and
Whenever the number of voice images stored in the storage device reaches a multiple of a preset first threshold value, performing deep learning learning on the voice images stored in the storage device to generate a corresponding voice model
Further comprising, a voice image-based user authentication method.

13. The method of claim 12,
Storing the voice image in the storage device is performed when the voice input from the user is an input for user registration,
The step of determining whether there is a pre-generated voice model is performed when the voice input from the user is an input for user login,
Voice image-based user authentication method.

13. The method of claim 12,
The step of generating the final authentication result is,
When the ratio of voice images for each user stored in the storage device is equal for each user and the total number of voice images stored in the storage device is equal to or greater than a preset threshold number, the second authentication result is generated as the final authentication result, or
When the ratio of voice images for each user stored in the storage device is not uniform for each user or the total number of voice images stored in the storage device is less than the preset threshold number, the first authentication result is generated as the final authentication result doing,
Voice image-based user authentication method.

15. The method of claim 14,
The process of generating the first authentication result is,
Storing the generated voice image in the storage device as a user voice image input during a login attempt,
The total number of voice images stored in the storage device includes a user voice image input at the time of the login attempt,
Voice image-based user authentication method.