KR20220045753A

KR20220045753A - Method for detecting face using voice

Info

Publication number: KR20220045753A
Application number: KR1020200128775A
Authority: KR
Inventors: 이동열
Original assignee: 주식회사 카카오뱅크
Priority date: 2020-10-06
Filing date: 2020-10-06
Publication date: 2022-04-13
Also published as: WO2022075702A1; KR20230104582A; US20230377367A1; KR102586075B1

Abstract

Disclosed is a face detection method using sound. A face detection method includes the steps of: receiving video data and sound data from a user terminal; deriving a first interval related to a predetermined message on the basis of the received sound data; configuring a second interval on the basis of the derived first interval; extracting a part of the video data corresponding to the second interval; deriving a video frame which satisfies a predetermined criterion, from the extracted video data; and detecting a facial image included in the derived video frame. Accordingly, it is possible to reduce time and resources required for face detection.

Description

{Method for detecting face using voice}

본 발명은 음성을 이용한 안면 검출 방법에 관한 것이다. 구체적으로, 본 발명은 수신된 음성데이터를 기초로 미리 정해진 메시지와 관련된 구간을 도출하고, 도출된 구간을 기준으로 추출된 영상데이터의 영상 프레임에서 안면 이미지를 검출하는 방법에 관한 것이다.The present invention relates to a face detection method using voice. Specifically, the present invention relates to a method of deriving a section related to a predetermined message based on received voice data, and detecting a facial image from an image frame of image data extracted based on the derived section.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

최근 스마트 디바이스와 네트워크의 발전, 그리고 다양한 네트워크 서비스의 발달로 인하여 종래 대면으로 이루어지던 은행업무를 포함하는 여러 업무들이 온라인/무선을 이용한 비대면 업무처리 형태로 전환되었다. 이때, 비대면 업무처리 중 사용자에 대한 본인인증이 필요한 경우, 사용자의 실시간 영상으로부터 사용자의 안면을 추출하여 미리 등록된 사용자의 사진과 비교하는 안면 검출 방법이 널리 사용되고 있다.Recently, due to the development of smart devices and networks, and the development of various network services, various tasks, including banking, which were conventionally performed face-to-face, have been converted to non-face-to-face business processing using online/wireless. In this case, when user authentication is required during non-face-to-face business processing, a face detection method in which a user's face is extracted from a real-time image of the user and compared with a pre-registered photo of the user is widely used.

종래의 안면 검출 방법은 녹화된 전체 영상에 대해 디코딩을 실행하고, 디코딩 된 녹화영상의 모든 프레임에 대해서 최적의 얼굴포즈가 존재하는 특정 프레임을 탐색하는 방식을 취하고 있어, 안면 검출에 대해 상당한 시간과 리소스를 필요로 하였다.The conventional face detection method performs decoding on the entire recorded image and searches for a specific frame in which an optimal face pose exists for all frames of the decoded recorded image. resources were required.

또한, 종래의 다른 안면 검출 방법은 녹화영상의 모든 프레임을 추출하고, 추출된 모든 프레임에 대해 안면 검출 알고리즘을 실행함으로써, 안면 검출에 이용되는 리소스가 급격하게 증가되는 문제점이 있었다.In addition, other conventional face detection methods have a problem in that resources used for face detection are rapidly increased by extracting all frames of a recorded image and executing a face detection algorithm on all the extracted frames.

따라서, 적은 시간과 리소스를 이용하여 동일한 효과를 얻을 수 있는 안면 검출 방법에 대한 니즈가 존재하였다.Therefore, there is a need for a face detection method that can achieve the same effect using less time and resources.

본 발명의 목적은, 주파수 영역으로 변환한 음성데이터를 이용하여 미리 정해진 메시지와 관련된 구간을 도출하고, 도출된 구간에 대응되는 영상데이터에서 미리 정해진 기준을 만족하는 영상 프레임을 도출하고, 도출된 영상 프레임에서 안면 이미지를 검출하는 방법을 제공하는 것이다.An object of the present invention is to derive a section related to a predetermined message using voice data converted into a frequency domain, derive an image frame satisfying a predetermined criterion from image data corresponding to the derived section, and to derive the derived image It is to provide a method for detecting a facial image in a frame.

또한, 본 발명의 다른 목적은, 미리 학습된 딥러닝 모듈을 이용하여 미리 정해진 메시지와 가장 관련도 높은 음성데이터의 구간을 도출하고, 도출된 구간에 대응되는 영상데이터에서 미리 정해진 기준을 만족하는 영상 프레임을 도출하고, 도출된 영상 프레임에서 안면 이미지를 검출하는 방법을 제공하는 것이다.In addition, another object of the present invention is to derive a section of voice data most highly related to a predetermined message using a pre-learned deep learning module, and an image satisfying a predetermined criterion in the image data corresponding to the derived section To provide a method of deriving a frame and detecting a face image from the derived image frame.

본 발명의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects and advantages of the present invention not mentioned may be understood by the following description, and will be more clearly understood by the examples of the present invention. It will also be readily apparent that the objects and advantages of the present invention may be realized by the means and combinations thereof indicated in the appended claims.

본 발명의 일 실시예에 따른 안면 검출 방법은, 사용자 단말과 연계된 서버에서 수행되는 안면 검출 방법에 있어서, 상기 사용자 단말로부터 영상데이터와 음성데이터를 수신하는 단계, 상기 수신된 음성데이터를 기초로 미리 정해진 메시지와 관련된 제1 구간을 도출하는 단계, 상기 도출된 제1 구간을 기초로 제2 구간을 설정하는 단계, 상기 제2 구간에 대응되는 상기 영상데이터의 일부를 추출하는 단계, 상기 추출된 영상데이터에서 미리 정해진 기준을 만족하는 영상 프레임을 도출하는 단계 및 상기 도출된 영상 프레임에 포함된 안면 이미지를 검출하는 단계를 포함한다.A face detection method according to an embodiment of the present invention, in the face detection method performed in a server associated with a user terminal, the steps of receiving image data and audio data from the user terminal, based on the received audio data deriving a first section related to a predetermined message, setting a second section based on the derived first section, extracting a part of the image data corresponding to the second section, the extracted and deriving an image frame that satisfies a predetermined criterion from the image data, and detecting a face image included in the derived image frame.

또한, 상기 제1 구간을 도출하는 단계는, 상기 음성데이터를 미리 정해진 시간단위마다 주파수 영역으로 변환한 스펙트로그램(spectrogram)을 생성하는 단계와, 상기 미리 정해진 메시지를 포함하는 음성데이터의 주파수 패턴을 생성하는 단계와, 상기 스펙트로그램에서 상기 주파수 패턴과 유사도가 가장 높은 구간을 상기 제1 구간으로 선정하는 단계를 포함할 수 있다.In addition, the step of deriving the first section includes generating a spectrogram obtained by converting the voice data into a frequency domain for each predetermined time unit, and generating a frequency pattern of voice data including the predetermined message. generating; and selecting a section having the highest similarity to the frequency pattern in the spectrogram as the first section.

또한, 상기 스펙트로그램을 생성하는 단계는, 상기 미리 정해진 시간단위로 설정된 제1 윈도우에 해당하는 제1 음성데이터를 주파수 영역으로 변환한 제1 스펙트럼을 생성하고, 상기 미리 정해진 시간단위로 설정되며, 상기 제1 윈도우와 다른 제2 윈도우에 해당하는 제2 음성데이터를 주파수 영역으로 변환한 제2 스펙트럼을 생성하고, 상기 제1 스펙트럼과 상기 제2 스펙트럼을 병합하여 상기 스펙트로그램을 생성하는 것을 포함한다.In addition, the generating of the spectrogram includes generating a first spectrum obtained by converting the first voice data corresponding to the first window set in the predetermined time unit into the frequency domain, and setting the predetermined time unit, generating a second spectrum obtained by converting second voice data corresponding to a second window different from the first window into a frequency domain, and generating the spectrogram by merging the first spectrum and the second spectrum .

또한, 상기 제1 윈도우와 상기 제2 윈도우는, 상기 음성데이터의 시간영역에서 일부 오버랩될 수 있다.Also, the first window and the second window may partially overlap in a time domain of the voice data.

또한, 상기 제1 구간을 도출하는 단계는, 상기 음성데이터를 미리 정해진 시간단위의 구간으로 샘플링하는 단계와, 미리 정해진 메시지를 포함하는 음성 패턴을 생성하는 단계와, 딥러닝 모듈을 이용하여 상기 샘플링된 구간별 음성데이터와, 상기 음성 패턴을 기초로 구간별 음성 유사도를 추출하는 단계와, 상기 음성 유사도가 미리 정해진 기준치보다 높은 구간을 상기 제1 구간으로 선정하는 단계를 포함할 수 있다.In addition, the step of deriving the first section includes the steps of sampling the voice data in a section of a predetermined time unit, generating a voice pattern including a predetermined message, and the sampling using a deep learning module. The method may include extracting a voice similarity for each section based on the obtained voice data for each section and the voice pattern, and selecting a section in which the voice similarity is higher than a predetermined reference value as the first section.

또한, 상기 딥러닝 모듈은, 상기 샘플링된 구간별 음성데이터 및 상기 음성 패턴을 입력 노드로 하는 입력 레이어와, 상기 음성 유사도를 출력 노드로 하는 출력 레이어와, 상기 입력 레이어와 상기 출력 레이어 사이에 배치되는 하나 이상의 히든 레이어를 포함하고, 상기 입력 노드와 상기 출력 노드 사이의 노드 및 에지의 가중치는 상기 딥러닝 모듈의 학습 과정에 의해 업데이트될 수 있다.In addition, the deep learning module is disposed between an input layer using the sampled speech data for each section and the speech pattern as an input node, an output layer using the speech similarity as an output node, and the input layer and the output layer It includes one or more hidden layers, and weights of nodes and edges between the input node and the output node may be updated by a learning process of the deep learning module.

또한, 상기 제2 구간은, 상기 음성데이터 내에서 상기 제1 구간보다 시계열적으로 후순위에 위치할 수 있다.In addition, the second section may be located at a lower priority than the first section in the voice data in time series.

또한, 상기 제2 구간의 일부는, 상기 제1 구간에 오버랩될 수 있다.Also, a portion of the second section may overlap the first section.

또한, 상기 영상 프레임을 도출하는 단계는, 상기 제2 구간에 대해, 미리 정해진 주기를 이용하여 하나 이상의 프레임을 도출하거나, 상기 제2 구간에서 각 프레임의 옵티컬 플로우(Optical flow)가 기준치보다 작은 프레임을 도출하는 것을 포함할 수 있다.In addition, the deriving of the image frame may include deriving one or more frames using a predetermined period for the second section, or a frame in which an optical flow of each frame is smaller than a reference value in the second section. may include deriving

또한, 상기 안면 이미지를 검출하는 단계는, 상기 도출된 각 프레임에 대한 안면 랜드마크를 도출하고, 상기 도출된 랜드마크를 기초로 안면 정렬을 위한 보정을 수행하고, 상기 보정된 이미지에서 특징점을 추출하는 것을 포함할 수 있다.In addition, the detecting of the facial image includes deriving a facial landmark for each of the derived frames, performing correction for facial alignment based on the derived landmark, and extracting feature points from the corrected image may include doing

본 발명의 다른 실시예에 따른 안면 검출 방법은, 사용자 단말과 연계된 서버에서 수행되는 안면 검출 방법에 있어서, 상기 사용자 단말로부터 영상데이터와 음성데이터를 수신하는 단계, 상기 수신된 음성데이터를 기초로 미리 정해진 메시지와 관련된 구간을 도출하는 단계, 상기 도출된 구간을 기준으로, 미리 정해진 범위의 상기 영상데이터의 일부를 추출하는 단계, 상기 추출된 영상데이터에서 미리 정해진 기준을 만족하는 영상 프레임을 도출하는 단계 및 상기 도출된 영상 프레임에 포함된 안면 이미지를 검출하는 단계를 포함한다.A face detection method according to another embodiment of the present invention is a face detection method performed in a server associated with a user terminal, the step of receiving image data and audio data from the user terminal, based on the received audio data Deriving a section related to a predetermined message, extracting a part of the image data in a predetermined range based on the derived section, and deriving an image frame satisfying a predetermined criterion from the extracted image data and detecting a facial image included in the derived image frame.

또한, 상기 구간을 도출하는 단계는, 상기 음성데이터를 미리 정해진 시간단위마다 주파수 영역으로 변환한 스펙트로그램을 생성하는 단계와, 상기 미리 정해진 메시지를 포함하는 음성데이터의 주파수 패턴을 생성하는 단계와, 상기 스펙트로그램에서 상기 주파수 패턴과 유사도가 가장 높은 구간을 상기 구간으로 선정하는 단계를 포함할 수 있다.In addition, the step of deriving the section includes: generating a spectrogram obtained by converting the voice data into a frequency domain for each predetermined time unit; generating a frequency pattern of the voice data including the predetermined message; The method may include selecting a section having the highest similarity to the frequency pattern in the spectrogram as the section.

또한, 상기 제1 사용자는 상기 원본 메일의 담당자이고, 상기 제2 사용자는 상기 담당자의 관리자일 수 있다.Also, the first user may be a person in charge of the original mail, and the second user may be an administrator of the person in charge.

또한, 상기 구간을 도출하는 단계는, 상기 음성데이터를 미리 정해진 시간단위의 구간으로 샘플링하는 단계와, 미리 정해진 메시지를 포함하는 음성 패턴을 생성하는 단계와, 딥러닝 모듈을 이용하여 상기 샘플링된 구간별 음성데이터와, 상기 음성 패턴을 기초로 구간별 음성 유사도를 추출하는 단계와, 상기 음성 유사도가 미리 정해진 기준치보다 높은 구간을 상기 구간으로 선정하는 단계를 포함할 수 있다.In addition, the step of deriving the section includes the steps of sampling the voice data in a section of a predetermined time unit, generating a voice pattern including a predetermined message, and using a deep learning module to the sampled section The method may include extracting a voice similarity for each section based on the voice data and the voice pattern, and selecting a section in which the voice similarity is higher than a predetermined reference value as the section.

본 발명의 안면 검출 방법은, 주파수 영역으로 변환한 음성데이터를 이용하여 미리 정해진 메시지와 관련된 구간을 도출하고, 도출된 구간에 대응되는 영상데이터에 포함된 프레임 내에서 안면 이미지를 검출함으로써, 정면으로 정렬된 최적의 안면 이미지를 빠르게 탐색할 수 있다. 이에 따라, 본 발명은 안면 검출에 소요되는 시간을 단축시켜 사용자의 안면 검출 속도를 향상시키고, 시스템에 인가되는 부하를 감소시킬 수 있다.The face detection method of the present invention derives a section related to a predetermined message using voice data converted into a frequency domain, and detects a face image within a frame included in image data corresponding to the derived section, You can quickly search for the best aligned facial image. Accordingly, the present invention can shorten the time required for face detection, improve the user's face detection speed, and reduce the load applied to the system.

또한, 본 발명의 안면 검출 방법은, 미리 학습된 딥러닝 모듈을 이용하여 미리 정해진 메시지와 가장 관련도 높은 음성데이터의 구간을 도출하고, 도출된 구간에 대응되는 영상데이터에 포함된 프레임 내에서 정면으로 정렬된 최적의 안면 이미지를 검출함으로써, 정면으로 정렬된 최적의 안면 이미지를 빠르게 탐색할 수 있다. 이를 통해, 본 발명은 안면 검출의 정확도를 높이고, 안면 검출에 필요한 시간과 리소스를 감소시킬 수 있다.In addition, the face detection method of the present invention derives a section of voice data most relevant to a predetermined message using a pre-learned deep learning module, and a front face within a frame included in the image data corresponding to the derived section By detecting the optimal facial image aligned with Through this, the present invention can increase the accuracy of face detection and reduce time and resources required for face detection.

상술한 내용과 더불어 본 발명의 구체적인 효과는 이하 발명을 실시하기 위한 구체적인 사항을 설명하면서 함께 기술한다.The specific effects of the present invention in addition to the above will be described together while explaining the specific details for carrying out the invention below.

도 1은 본 발명의 실시예에 따른 안면 검출 방법을 수행하는 시스템을 설명하기 위한 개념도이다.
도 2는 본 발명의 몇몇 실시예에 따른 안면 검출 방법을 기초로 안면 유사도를 산출하는 과정을 설명하기 위한 도면이다.
도 3은 본 발명의 몇몇 실시예에 따른 안면 검출 방법을 설명하기 위한 순서도이다.
도 4는 도 3의 S220 단계에 따른 제1 구간을 도출하는 방법의 일 예를 설명하기 위한 순서도이다.
도 5는 도 4의 S321 단계에서 스펙트로그램을 생성하는 몇몇 예시를 설명하기 위한 도면이다.
도 6은 도 4의 안면 검출 방법을 통해 생성된 스펙트로그램을 설명하기 위한 도면이다.
도 7은 도 3의 S220 단계에 따른 제1 구간을 도출하는 방법의 다른 예를 설명하기 위한 도면이다.
도 8은 도 7의 안면 검출 방법에서 이용되는 딥러닝 모듈을 개략적으로 설명하기 위한 블록도이다.
도 9는 도 8의 딥러닝 모듈의 구성을 도시한 도면이다.
도 10은 도 3의 S250 단계 및 S260 단계에 대한 몇몇 예시를 설명하기 위한 순서도이다.
도 11은 본 발명의 몇몇 실시예에 따른 안면 검출 방법을 수행하는 시스템의 하드웨어 구현을 설명하기 위한 도면이다.1 is a conceptual diagram for explaining a system for performing a face detection method according to an embodiment of the present invention.
2 is a view for explaining a process of calculating the degree of facial similarity based on the face detection method according to some embodiments of the present invention.
3 is a flowchart illustrating a face detection method according to some embodiments of the present invention.
4 is a flowchart illustrating an example of a method of deriving a first section according to step S220 of FIG. 3 .
FIG. 5 is a view for explaining some examples of generating a spectrogram in step S321 of FIG. 4 .
FIG. 6 is a view for explaining a spectrogram generated through the face detection method of FIG. 4 .
FIG. 7 is a diagram for explaining another example of a method of deriving a first section according to step S220 of FIG. 3 .
8 is a block diagram schematically illustrating a deep learning module used in the face detection method of FIG. 7 .
9 is a diagram illustrating the configuration of the deep learning module of FIG. 8 .
10 is a flowchart for explaining some examples of steps S250 and S260 of FIG. 3 .
11 is a diagram for explaining a hardware implementation of a system for performing a face detection method according to some embodiments of the present invention.

본 명세서 및 특허청구범위에서 사용된 용어나 단어는 일반적이거나 사전적인 의미로 한정하여 해석되어서는 아니된다. 발명자가 그 자신의 발명을 최선의 방법으로 설명하기 위해 용어나 단어의 개념을 정의할 수 있다는 원칙에 따라, 본 발명의 기술적 사상과 부합하는 의미와 개념으로 해석되어야 한다. 또한, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명이 실현되는 하나의 실시예에 불과하고, 본 발명의 기술적 사상을 전부 대변하는 것이 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 및 응용 가능한 예들이 있을 수 있음을 이해하여야 한다.Terms or words used in this specification and claims should not be construed as being limited to a general or dictionary meaning. In accordance with the principle that the inventor can define a term or concept of a word in order to best describe his/her invention, it should be interpreted as meaning and concept consistent with the technical idea of the present invention. In addition, since the embodiments described in this specification and the configurations shown in the drawings are only one embodiment in which the present invention is realized, and do not represent all the technical spirit of the present invention, they can be substituted at the time of the present application. It should be understood that there may be various equivalents and modifications and applicable examples.

본 명세서 및 특허청구범위에서 사용된 제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. '및/또는' 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, B, etc. used in this specification and claims may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. The term 'and/or' includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서 및 특허청구범위에서 사용된 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서 "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this specification and claims are used only to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. It should be understood that terms such as “comprise” or “have” in the present application do not preclude the possibility of addition or existence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification in advance. .

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해서 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 본 발명의 각 실시예에 포함된 각 구성, 과정, 공정 또는 방법 등은 기술적으로 상호 간 모순되지 않는 범위 내에서 공유될 수 있다. In addition, each configuration, process, process or method included in each embodiment of the present invention may be shared within a range that does not technically contradict each other.

이하에서는, 도 1 내지 도 11을 참조하여 본 발명의 실시예에 따른 안면 검출 방법 및 이를 수행하는 시스템에 대해 자세히 설명하도록 한다.Hereinafter, a face detection method and a system for performing the same according to an embodiment of the present invention will be described in detail with reference to FIGS. 1 to 11 .

도 1은 본 발명의 실시예에 따른 안면 검출 방법을 수행하는 시스템을 설명하기 위한 개념도이다. 1 is a conceptual diagram for explaining a system for performing a face detection method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 시스템은, 금융사 서버(100), 사용자 단말(200) 및 상담원 단말(300)을 포함한다. Referring to FIG. 1 , a system according to an embodiment of the present invention includes a financial company server 100 , a user terminal 200 , and a counselor terminal 300 .

금융사 서버(100)(이하, 서버)는 사용자 단말(200)과 상담원 단말(300) 간의 영상통화를 중개하며, 영상통화 데이터를 이용하여 사용자의 신원확인 또는 본인인증을 수행할 수 있다. 이때, 서버(100)는 안면 검출 방법을 이용하여 영상통화에서 사용자의 안면 이미지를 추출하고, 추출된 안면 이미지를 이용하여 사용자의 신원확인 또는 본인인증을 수행할 수 있다. The financial company server 100 (hereinafter, the server) mediates a video call between the user terminal 200 and the counselor terminal 300 , and may perform user identification or user authentication using video call data. In this case, the server 100 may extract the user's face image from the video call by using the face detection method, and perform identification or identity authentication of the user using the extracted face image.

다만, 서버(100)에서 수행되는 안면 검출 방법이 위의 동작에 국한되는 것은 아니며, 다양한 실시예에서 응용되어 수행될 수 있음은 자명하나, 이하에서는 설명의 편의를 위하여 영상통화에서 사용자의 본인인증을 수행하는 것을 예로 들어 설명하도록 한다.However, it is self-evident that the face detection method performed in the server 100 is not limited to the above operation, and can be applied and performed in various embodiments. It will be described with an example of performing

서버(100)는 안면 검출 방법의 수행주체로써 동작할 수 있다. 구체적으로, 서버(100)는 사용자 단말(200)로부터 영상통화 데이터를 수신할 수 있다. 이때, 영상통화 데이터는 사용자의 목소리를 녹음한 음성데이터 및 사용자의 얼굴을 촬영한 영상데이터를 포함할 수 있다.The server 100 may operate as a subject performing the face detection method. Specifically, the server 100 may receive video call data from the user terminal 200 . In this case, the video call data may include voice data in which the user's voice is recorded and image data in which the user's face is photographed.

이어서, 서버(100)는 수신된 음성데이터를 기초로 미리 정해진 메시지와 관련된 특정 구간(이하, 제1 구간)을 도출할 수 있다. Subsequently, the server 100 may derive a specific section (hereinafter, referred to as a first section) related to a predetermined message based on the received voice data.

이때, 서버(100)는 사용자의 음성데이터를 주파수 영역으로 변환하는 과정을 통해 생성한 스펙트로그램(spectrogram), 또는 딥러닝 모듈(Deep learning module)을 이용하여, 미리 정해진 메시지를 포함하는 음성 패턴과 유사한 음성데이터 구간을 도출할 수 있다.At this time, the server 100 uses a spectrogram or a deep learning module generated through the process of converting the user's voice data into a frequency domain, and a voice pattern including a predetermined message and A similar voice data section can be derived.

여기에서, 스펙트로그램(spectrogram)은 소리나 파동을 시각화하여 파악하기 위한 도구로, 파형(waveform)과 스펙트럼(spectrum)의 특징이 조합된 그래프를 의미한다. 파형(waveform) 그래프에서는 시간축의 변화에 따른 진폭 축의 변화가 나타나고, 스펙트럼(spectrum)에서는 주파수 축의 변화에 따른 진폭 축의 변화가 나타나는 반면, 스펙트로그램에서는 시간축과 주파수 축의 변화에 따라 진폭의 차이를 인쇄 농도 또는 표시 색상의 차이로 나타내게 된다. Here, a spectrogram is a tool for visualizing and grasping a sound or wave, and refers to a graph in which the characteristics of a waveform and a spectrum are combined. In the waveform graph, the change in the amplitude axis according to the change in the time axis appears, and in the spectrum, the change in the amplitude axis according to the change in the frequency axis appears, whereas in the spectrogram, the difference in the amplitude according to the change in the time axis and the frequency axis is displayed. Alternatively, it is indicated by a difference in display color.

본 발명의 일 실시예에서, 서버(100)는 음성데이터의 스펙트로그램을 이용하여 제1 구간을 도출할 수 있다.In an embodiment of the present invention, the server 100 may derive the first section by using the spectrogram of the voice data.

구체적으로, 서버(100)는 음성데이터를 미리 정해진 시간단위마다 주파수 영역으로 변환한 스펙트로그램을 생성한다. 이어서, 서버(100)는 미리 정해진 메시지(예를 들어, “카메라 정면에 얼굴을 향해 주세요”)를 포함하는 음성데이터의 주파수 패턴을 생성한다. Specifically, the server 100 generates a spectrogram obtained by converting voice data into a frequency domain for each predetermined time unit. Next, the server 100 generates a frequency pattern of the voice data including a predetermined message (eg, "Please face the camera in front of").

이어서, 서버(100)는 생성된 주파수 패턴과 가장 유사한 스펙트로그램 내의 구간을 제1 구간으로 설정할 수 있다. 이때, 제1 구간은 시간축을 기준으로 설정될 수 있다. 스펙트로그램을 이용하여 음성데이터 구간을 도출하는 과정은 도 4 내지 도 6을 통해 자세히 설명하도록 한다.Subsequently, the server 100 may set a section in the spectrogram most similar to the generated frequency pattern as the first section. In this case, the first section may be set based on the time axis. A process of deriving a voice data section using a spectrogram will be described in detail with reference to FIGS. 4 to 6 .

또한, 본 발명의 다른 실시예에서, 서버(100)는 미리 학습된 딥러닝 모듈을 이용하여 제1 구간을 도출할 수 있다. Also, in another embodiment of the present invention, the server 100 may derive the first section using a pre-trained deep learning module.

구체적으로, 서버(100)는 음성데이터를 미리 정해진 시간단위의 구간으로 샘플링한다. 이어서, 서버(100)는 미리 정해진 메시지(예를 들어, “카메라 정면에 얼굴을 향해 주세요”)를 포함하는 음성 패턴을 생성할 수 있다. 이어서, 서버(100)는 미리 학습된 딥러닝 모듈을 이용하여 샘플링된 음성데이터와, 생성된 음성 패턴을 비교하여 구간별 음성 유사도를 산출할 수 있다. 이때, 음성 유사도를 산출하는 알고리즘은 다양하게 변형되어 이용될 수 있으며, 해당 알고리즘에 대한 자세한 설명은 통상의 기술자에게 널리 알려져 있는 바, 이에 대한 자세한 설명은 여기에서 생략하도록 한다. Specifically, the server 100 samples the voice data in a section of a predetermined time unit. Subsequently, the server 100 may generate a voice pattern including a predetermined message (eg, “Please point your face in front of the camera”). Then, the server 100 may calculate the voice similarity for each section by comparing the voice data sampled using the pre-learned deep learning module and the generated voice pattern. In this case, the algorithm for calculating the voice similarity may be used with various modifications, and a detailed description of the algorithm is widely known to those skilled in the art, and a detailed description thereof will be omitted here.

이어서, 서버(100)는 유사도가 미리 정해진 기준치보다 높은 구간을 제1 구간으로 선정할 수 있다. 딥러닝 모듈을 이용하여 음성데이터 구간을 도출하는 과정은 도 7 내지 도 9를 이용하여 후술하도록 한다.Subsequently, the server 100 may select a section having a similarity higher than a predetermined reference value as the first section. The process of deriving the voice data section using the deep learning module will be described later with reference to FIGS. 7 to 9 .

이어서, 서버(100)는 도출된 제1 구간을 기준으로 제2 구간을 도출할 수 있다. 이때, 제2 구간은 제1 구간과 다른 위치에 배치될 수 있으며, 미리 정해진 메시지의 종류에 따라 상대적인 위치가 다르게 설정될 수 있다. Subsequently, the server 100 may derive the second section based on the derived first section. In this case, the second section may be disposed at a different position from the first section, and a relative position may be set differently according to a predetermined message type.

예를 들어, “카메라 정면에 얼굴을 향해 주세요”라는 미리 정해진 메시지를 기준으로 제2 구간이 도출되는 경우, 제2 구간은 음성데이터 내에서 제1 구간보다 시계열적으로 뒤에(즉, 후순위에) 위치할 수 있다.For example, when the second section is derived based on the predetermined message “Please face your face in front of the camera”, the second section is time-series behind the first section in the voice data (that is, at a lower priority) can be located

다른 예로, “얼굴 검사를 완료하였습니다.”라는 미리 정해진 메시지를 기준으로 제2 구간이 도출되는 경우, 제2 구간은 음성데이터 내에서 제1 구간보다 시계열적으로 앞에 위치할 수 있다.As another example, when the second section is derived based on the predetermined message “Face examination has been completed”, the second section may be located in time series ahead of the first section in the voice data.

이어서, 서버(100)는 도출된 구간(미리 정해진 메시지와 관련된 구간; 즉, 제2 구간)을 기준으로 영상데이터의 일부를 추출하고, 추출된 영상데이터에 포함된 영상 프레임을 도출할 수 있다.Then, the server 100 may extract a part of the image data based on the derived section (section related to a predetermined message; that is, the second section) and derive an image frame included in the extracted image data.

이때, 서버(100)는 도출된 구간에 대해 다양한 방법으로 영상 프레임을 도출할 수 있다.In this case, the server 100 may derive an image frame in various ways for the derived section.

예를 들어, 서버(100)는 일정 시간 간격(예를 들어, 1/n 프레임 간격)으로 영상 프레임을 도출할 수 있다. 다른 예로, 서버(100)는 도출된 구간의 옵티컬 플로우(Optical flow)가 기준치보다 작은 프레임을 도출할 수 있다. 여기에서, 옵티컬 플로우란, 카메라에 의해 촬영되어 입력되는 시간적으로 다른 2개의 영상데이터로부터 그 영상에 나타나는 외견상 움직임을 벡터로 나타낸 것을 말한다. 다만, 이는 영상 프레임을 도출하는 몇몇 예시에 불과하고, 본 발명은 다양한 방법을 통해 영상 프레임이 도출될 수 있음은 물론이다. 이어서, 서버(100)는 도출된 영상 프레임에서 안면 이미지를 검출할 수 있다. 영상 프레임 도출 및 안면 이미지를 검출하는 방법은 도 10에서 자세히 설명하도록 한다.For example, the server 100 may derive an image frame at a predetermined time interval (eg, 1/n frame interval). As another example, the server 100 may derive a frame in which the optical flow of the derived section is smaller than the reference value. Here, the optical flow refers to a vector representing an apparent motion appearing in an image from two temporally different image data captured and input by a camera. However, these are only some examples of deriving an image frame, and it goes without saying that the present invention may derive an image frame through various methods. Subsequently, the server 100 may detect a face image from the derived image frame. A method of deriving an image frame and detecting a face image will be described in detail with reference to FIG. 10 .

이어서, 서버(100)는 도출된 안면 이미지를 이용하여, 사용자의 신원확인 또는 본인인증의 절차를 수행할 수 있다.Subsequently, the server 100 may perform a procedure of user identification or user authentication by using the derived facial image.

본 발명에서 서버(100)와 사용자 단말(200)은 서버-클라이언트 시스템으로 구현될 수 있다. 구체적으로, 서버(100)는 각 사용자 계정에 대해 음성데이터, 영상데이터 및 미리 입력받은 안면 이미지(예를 들어, 신분증 이미지 또는 과거에 검출된 안면 이미지 등)를 분류하여 저장 및 관리할 수 있고, 금융정보 제공 및 영상통화 등과 관련된 다양한 서비스를 사용자 단말(200)에 설치된 단말 어플리케이션을 통해 제공할 수 있다.In the present invention, the server 100 and the user terminal 200 may be implemented as a server-client system. Specifically, the server 100 can classify, store and manage voice data, image data, and pre-input facial images (eg, ID images or previously detected facial images, etc.) for each user account, Various services related to financial information provision and video call may be provided through a terminal application installed in the user terminal 200 .

이때, 단말 어플리케이션은 음성데이터 및 영상데이터를 수신하기 위한 전용 어플리케이션이거나, 웹 브라우징 어플리케이션일 수 있다. 여기에서, 전용 어플리케이션은 사용자 단말(200)에 내장된 어플리케이션이거나, 어플리케이션 배포 서버로부터 다운로드 되어 사용자 단말(200)에 설치된 어플리케이션일 수 있다.In this case, the terminal application may be a dedicated application for receiving voice data and image data, or a web browsing application. Here, the dedicated application may be an application built into the user terminal 200 or an application downloaded from an application distribution server and installed in the user terminal 200 .

사용자 단말(200)은 유무선 통신 환경에서 어플리케이션을 동작시킬 수 있는 통신 단말기를 의미한다. 도 1에서 사용자 단말(200)은 휴대용 단말기의 일종인 스마트폰(smart phone)으로 도시되었지만, 본 발명이 이에 한정되는 것은 아니며, 상술한 바와 같이 금융 어플리케이션을 동작시킬 수 있는 장치에 제한없이 적용될 수 있다. 예를 들어, 사용자 단말(200)은 퍼스널 컴퓨터(PC), 노트북, 태블릿, 휴대폰, 스마트폰, 웨어러블 디바이스(예를 들어, 워치형 단말기) 등의 다양한 형태의 전자 장치를 포함할 수 있다.The user terminal 200 refers to a communication terminal capable of operating an application in a wired/wireless communication environment. Although the user terminal 200 is illustrated as a smart phone, which is a type of portable terminal in FIG. 1 , the present invention is not limited thereto, and as described above, it can be applied without limitation to a device capable of operating a financial application. there is. For example, the user terminal 200 may include various types of electronic devices such as a personal computer (PC), a notebook computer, a tablet, a mobile phone, a smart phone, and a wearable device (eg, a watch-type terminal).

또한, 도면 상에는 하나의 사용자 단말(200)만을 도시하였으나, 본 발명이 이에 한정되는 것은 아니며, 서버(100)는 복수의 사용자 단말(200)과 연동하여 동작할 수 있다.In addition, although only one user terminal 200 is illustrated in the drawing, the present invention is not limited thereto, and the server 100 may operate in conjunction with a plurality of user terminals 200 .

부가적으로, 사용자 단말(200)은 사용자의 입력을 수신하는 입력부, 비주얼 정보를 디스플레이 하는 디스플레이부, 외부와 신호를 송수신하는 통신부, 사용자의 얼굴을 촬영하는 카메라부, 사용자의 음성을 디지털 데이터로 변환하는 마이크부, 및 데이터를 프로세싱하고 사용자 단말(200) 내부의 각 유닛들을 제어하며 유닛들 간의 데이터 송/수신을 제어하는 제어부를 포함할 수 있다. 이하, 사용자의 명령에 따라 제어부가 사용자 단말(200) 내부에서 수행하는 명령은 사용자 단말(200)이 수행하는 것으로 통칭한다.Additionally, the user terminal 200 includes an input unit for receiving a user's input, a display unit for displaying visual information, a communication unit for transmitting and receiving signals with the outside, a camera unit for photographing the user's face, and the user's voice as digital data. It may include a microphone unit that converts, and a control unit that processes data, controls each unit inside the user terminal 200, and controls data transmission/reception between the units. Hereinafter, a command performed by the control unit inside the user terminal 200 according to a user's command is collectively referred to as a command performed by the user terminal 200 .

한편, 상담원 단말(300)은 서버(100)와 상호 연계되어 동작하며, 사용자 단말(200)과 영상통화를 수행하는 상대방이 될 수 있다. 도면에 명확하게 도시하지는 않았으나, 서버(100)는 복수의 상담원 단말(300)과 연계되어 동작하며, 사용자 단말(200)로부터 영상통화요청이 수신되는 경우, 복수의 상담원 단말(300) 중 어느 하나를 선택하여 영상통화를 요청한 사용자 단말(200)과 매칭시킬 수 있다.On the other hand, the counselor terminal 300 operates in conjunction with the server 100 , and may be a counterpart performing a video call with the user terminal 200 . Although not clearly shown in the drawing, the server 100 operates in connection with the plurality of agent terminals 300 , and when a video call request is received from the user terminal 200 , any one of the plurality of agent terminals 300 . may be selected to match with the user terminal 200 requesting a video call.

서버(100)는 매칭된 사용자 단말(200)과 상담원 단말(300)에 상호 영상통화를 수행할 수 있도록 중계하는 역할을 수행한다. 이때, 서버(100)는 사용자 단말(200)과 상담원 단말(300) 간의 영상통화의 내역을 저장 관리할 수 있다.The server 100 serves as a relay so that the matched user terminal 200 and the counselor terminal 300 can perform a video call with each other. In this case, the server 100 may store and manage the details of the video call between the user terminal 200 and the counselor terminal 300 .

한편, 통신망(400)은 서버(100), 사용자 단말(200) 및 상담원 단말(300)을 연결하는 역할을 수행한다. 즉, 통신망(400)은 사용자 단말(200) 또는 상담원 단말(300)이 서버(100)에 접속한 후 데이터를 송수신할 수 있도록 접속 경로를 제공하는 통신망을 의미한다. 통신망(400)은 예컨대 LANs(Local Area Networks), WANs(Wide Area Networks), MANs(Metropolitan Area Networks), ISDNs(Integrated Service Digital Networks) 등의 유선 네트워크나, 무선 LANs, CDMA, 블루투스, 위성 통신 등의 무선 네트워크를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.Meanwhile, the communication network 400 serves to connect the server 100 , the user terminal 200 , and the counselor terminal 300 . That is, the communication network 400 refers to a communication network that provides an access path so that the user terminal 200 or the agent terminal 300 can transmit and receive data after accessing the server 100 . The communication network 400 is, for example, a wired network such as LANs (Local Area Networks), WANs (Wide Area Networks), MANs (Metropolitan Area Networks), ISDNs (Integrated Service Digital Networks), wireless LANs, CDMA, Bluetooth, satellite communication, etc. may cover a wireless network, but the scope of the present invention is not limited thereto.

이하에서는, 본 발명의 실시예에 따른 시스템에서 수행되는 안면 검출 방법에 대해 구체적으로 살펴보도록 한다.Hereinafter, a face detection method performed in a system according to an embodiment of the present invention will be described in detail.

도 2는 본 발명의 몇몇 실시예에 따른 안면 검출 방법을 기초로 안면 유사도를 산출하는 과정을 설명하기 위한 도면이다.2 is a diagram for explaining a process of calculating a degree of facial similarity based on a method of detecting a face according to some embodiments of the present invention.

도 2를 참조하면, 서버(100)는 사용자 단말(200)로부터 수신한 영상통화 데이터(VC) 중 음성데이터(SD)를 이용하여 사용자의 음성을 분석하여, 영상데이터(VD) 중 일부에 해당하는 특정 구간을 추출한다(S110).Referring to FIG. 2 , the server 100 analyzes the user's voice using the voice data SD among the video call data VC received from the user terminal 200, and corresponds to some of the video data VD. to extract a specific section (S110).

구체적으로, 서버(100)는 영상통화가 진행되는 사용자 단말(200)로부터 영상데이터(VD) 및 음성데이터(SD)를 포함하는 영상통화 데이터(VC)를 실시간으로 수신할 수 있다. 서버(100)는 수신된 음성데이터(SD)를 분석하여 미리 정해진 메시지(예를 들어, “카메라 정면에 얼굴을 향해 주세요” 또는 “얼굴 촬영이 완료되었습니다.”)와 관련된 구간을 도출할 수 있다.Specifically, the server 100 may receive the video call data VC including the video data VD and the audio data SD from the user terminal 200 in which the video call is performed in real time. The server 100 may analyze the received voice data (SD) to derive a section related to a predetermined message (for example, “Please face your face in front of the camera” or “Your face has been photographed”). .

이때, 서버(100)는 스펙트로그램 또는 딥러닝 모듈을 이용하여 미리 정해진 메시지와 관련된 구간을 도출할 수 있다. 이에 대한 자세한 설명은 도 4 내지 도 6 및 도 7 내지 도 9에서 자세히 설명하도록 한다.In this case, the server 100 may derive a section related to a predetermined message by using a spectrogram or a deep learning module. A detailed description thereof will be described in detail with reference to FIGS. 4 to 6 and 7 to 9 .

이어서, 서버(100)는 추출된 음성데이터(SD)의 특정 구간에 해당하는 영상데이터(VD)에서, 샘플링을 통해 특정 프레임을 추출한다(S120).Next, the server 100 extracts a specific frame from the video data VD corresponding to a specific section of the extracted audio data SD through sampling (S120).

여기에서, 서버(100)는 도출된 특정 구간을 기준으로, 미리 정해진 범위의 영상데이터(VD)의 일부 구간을 추출할 수 있다. 서버(100)는 추출된 영상데이터(VD)에서 미리 정해진 기준을 만족하는 몇몇 영상 프레임을 도출할 수 있다.Here, the server 100 may extract a partial section of the image data VD in a predetermined range based on the derived specific section. The server 100 may derive several image frames satisfying a predetermined criterion from the extracted image data VD.

예를 들어, 서버(100)는 추출된 영상데이터(VD)에 대해 일정 시간 간격으로 프레임을 샘플링하거나, 옵티컬 플로우가 기준치보다 작은 영상 프레임을 도출하여 샘플링 할 수 있다.For example, the server 100 may sample a frame with respect to the extracted image data VD at regular time intervals or may derive and sample an image frame having an optical flow smaller than a reference value.

다른 예로, 서버(100)는 추출된 영상데이터(VD)에 대해 포즈 검출 알고리즘을 동작시킬 수 있다. 포즈 검출 알고리즘에 의해 미리 정해진 포즈가 검출된 경우, 서버(100)는 포즈 검출 알고리즘을 종료하고 검출된 포즈와 관련된 영상 프레임을 추출할 수 있다. As another example, the server 100 may operate a pose detection algorithm on the extracted image data VD. When a predetermined pose is detected by the pose detection algorithm, the server 100 may terminate the pose detection algorithm and extract an image frame related to the detected pose.

다만, 이는 영상 프레임을 도출하는 몇몇 예시에 불과하며, 본 발명이 이에 제한되는 것은 아니다.However, these are only some examples of deriving an image frame, and the present invention is not limited thereto.

이어서, 서버(100)는 추출된 영상 프레임에서 사용자의 안면을 검출한다(S130). 서버(100)는 미리 학습된 딥러닝 모델(예를 들어, MTCNN, Retinaface, 또는 Blazeface)을 이용하여 사용자의 안면을 검출할 수 있다. 사용자의 안면은 영상 프레임 내에서 바운딩 박스를 이용하여 검출될 수 있다. 이때, 서버(100)에서 사용되는 딥러닝 모델은 다양하게 변형되어 사용될 수 있다.Next, the server 100 detects the user's face from the extracted image frame (S130). The server 100 may detect the user's face using a pre-trained deep learning model (eg, MTCNN, Retinaface, or Blazeface). The user's face may be detected using a bounding box within the image frame. In this case, the deep learning model used in the server 100 may be variously modified and used.

이어서, 서버(100)는 추출된 사용자의 안면을 정렬한다(S140). Then, the server 100 aligns the extracted user's face (S140).

구체적으로, 서버(100)는 추출된 안면에 대한 안면 랜드마크를 검출할 수 있다. 이때, 안면 랜드마크란 눈, 코, 입, 턱선 및 콧대와 같은 안면의 특징을 구성하는 부분을 뜻한다. 이어서, 서버(100)는 검출된 안면 랜드마크를 기초로 안면을 정렬할 수 있다. 예를 들어, 서버(100)는 눈과 눈 사이에 직선을 형성하고, 해당 직선과 가로 수평선 사이의 각도를 측정하여 반대각도만큼 안면 이미지를 회전시키는 방법을 이용할 수 있다. 다만, 이는 하나의 예시에 불과하며 본 발명이 이에 한정되는 것은 아니다.Specifically, the server 100 may detect a facial landmark for the extracted face. In this case, the facial landmark refers to a part constituting facial features such as eyes, nose, mouth, jaw line, and bridge of the nose. Then, the server 100 may align the face based on the detected facial landmark. For example, the server 100 may use a method of forming a straight line between the eyes, measuring an angle between the corresponding straight line and a horizontal horizontal line, and rotating the face image by an opposite angle. However, this is only an example and the present invention is not limited thereto.

이어서, 서버(100)는 정렬된 안면의 특징점을 추출한다(S150). Next, the server 100 extracts the aligned facial feature points (S150).

이어서, 서버(100)는 추출된 안면의 특징점을 이용하여 안면의 유사도를 산출한다(S160). 이때, 서버(100)는 추출된 안면 특징점을 실수 벡터로 표현할 수 있으며, 미리 저장된 사용자의 신분증 이미지에서 추출된 특징점과 비교하는 과정을 통하여 안면 유사도를 산출할 수 있다. 이렇게 산출된 안면 유사도는, 사용자 얼굴의 동일성 판단에 이용될 수 있다. Next, the server 100 calculates the degree of facial similarity by using the extracted facial feature points (S160). In this case, the server 100 may express the extracted facial feature points as a real vector, and may calculate the facial similarity through a process of comparing the extracted facial feature points with the feature points extracted from the previously stored ID image of the user. The calculated facial similarity may be used to determine the identity of the user's face.

이하에서는, 본 발명의 몇몇 실시예에 따른 안면 검출 방법에서 제1 구간 및 제2 구간을 도출하는 과정에 대해 자세히 설명하도록 한다.Hereinafter, a process of deriving the first section and the second section in the face detection method according to some embodiments of the present invention will be described in detail.

도 3은 본 발명의 몇몇 실시예에 따른 안면 검출 방법을 설명하기 위한 순서도이다.3 is a flowchart illustrating a face detection method according to some embodiments of the present invention.

도 3을 참조하면, 서버(100)는 영상통화를 통해 영상데이터 및 음성데이터를 수신한다(S120).Referring to FIG. 3 , the server 100 receives video data and audio data through a video call ( S120 ).

이어서, 서버(100)는 수신된 음성데이터를 기초로, 미리 정해진 메시지와 관련된 제1 구간을 도출한다(S220).Next, the server 100 derives a first section related to a predetermined message based on the received voice data (S220).

예를 들어, 서버(100)는 수신한 음성데이터에서 “카메라 정면에 얼굴을 향해 주세요”라는 미리 정해진 메시지가 출력되는 구간을 제1 구간으로 설정할 수 있다. 이때, 서버(100)는 음성데이터를 주파수 영역으로 변환한 스펙트로그램 또는 미리 학습된 딥러닝 모듈을 이용하여 미리 정해진 메시지와 관련된 제1 구간을 도출할 수 있다.For example, the server 100 may set a section in which a predetermined message “Please face your face in front of the camera” in the received voice data is output as the first section. In this case, the server 100 may derive a first section related to a predetermined message by using a spectrogram obtained by converting voice data into a frequency domain or a pre-learned deep learning module.

이어서, 서버(100)는 도출된 제1 구간을 기초로 제2 구간을 설정한다(S230).Next, the server 100 sets a second section based on the derived first section (S230).

예를 들어, 서버(100)는 도출된 제1 구간의 종료지점부터 약 10초동안의 구간 또는 제1 구간의 종료지점부터 “얼굴 촬영이 완료되었습니다.”라는 메시지가 포함된 부분까지의 구간을 제2 구간으로 설정할 수 있다. 다만, 이는 하나의 예시일 뿐, 본 발명이 이에 제한되는 것은 아니다.For example, the server 100 selects a section from the end point of the derived first section for about 10 seconds or a section from the end point of the first section to a section containing the message “Face shooting is complete.” It can be set as the second section. However, this is only an example, and the present invention is not limited thereto.

여기에서, 제2 구간은 음성데이터 내에서 제1 구간보다 시계열적으로 후순위에 위치할 수 있고, 제2 구간의 일부는 제1 구간에 오버랩 될 수 있음은 물론이다.Here, of course, the second section may be located at a lower priority than the first section in the voice data in time series, and a part of the second section may overlap the first section.

이어서, 서버(100)는 제2 구간에 대응되는 영상데이터의 일부를 추출한다(S240).Then, the server 100 extracts a part of the image data corresponding to the second section (S240).

이어서, 서버(100)는 추출된 영상데이터에서 미리 정해진 기준을 만족하는 영상 프레임을 도출한다(S250). 이때, 서버(100)는 영상데이터에 대해 미리 설정된 일정 시간을 주기(예를 들어, 1/n)마다 영상 프레임을 도출하거나, 옵티컬 플로우를 이용하여 영상 프레임을 도출할 수 있다.Next, the server 100 derives an image frame that satisfies a predetermined criterion from the extracted image data (S250). In this case, the server 100 may derive an image frame every period (eg, 1/n) for a predetermined time period (eg, 1/n) for image data, or may derive an image frame using an optical flow.

이어서, 서버(100)는 도출된 영상 프레임에 포함된 안면 이미지를 검출한다(S260). 이때, 서버(100)는 미리 학습된 딥러닝 모델(예를 들어, MTCNN, Retinaface, 또는 Blazeface)을 이용하여 사용자의 안면을 검출할 수 있고, 사용자의 안면은 영상 프레임 내에서 바운딩 박스를 이용하여 검출될 수 있다. 다만, 본 발명이 이에 한정되는 것은 아니며, 서버(100)에서 사용되는 딥러닝 모델은 다양하게 변형되어 사용될 수 있음은 물론이다.Next, the server 100 detects a facial image included in the derived image frame (S260). At this time, the server 100 may detect the user's face using a pre-trained deep learning model (eg, MTCNN, Retinaface, or Blazeface), and the user's face is within the image frame using a bounding box. can be detected. However, the present invention is not limited thereto, and it goes without saying that the deep learning model used in the server 100 may be variously modified and used.

이하에서는 본 발명의 일 실시예에 따른 스펙트로그램을 이용하여 제1 구간을 도출하는 안면 검출 방법에 대해 설명하도록 한다.Hereinafter, a face detection method for deriving a first section using a spectrogram according to an embodiment of the present invention will be described.

도 4는 도 3의 S220 단계에 따른 제1 구간을 도출하는 방법의 일 예를 설명하기 위한 순서도이다.4 is a flowchart illustrating an example of a method of deriving a first section according to step S220 of FIG. 3 .

도 4를 참조하면, S210 단계에 이어서, 서버(100)는 음성데이터를 특정 시간단위 마다 주파수 영역으로 변환하여 스펙트로그램을 생성한다(S321).Referring to FIG. 4 , following step S210 , the server 100 converts voice data into a frequency domain for each specific time unit to generate a spectrogram ( S321 ).

구체적으로, 서버(100)는 사용자 단말(200)로부터 수신한 음성데이터에 대해 미리 정해진 시간단위를 기초로 분할할 수 있다. 이어서, 서버(100)는 분할된 복수의 음성데이터를 각각 주파수 영역으로 변환하여 복수 개의 스펙트럼을 생성하고, 생성된 복수 개의 스펙트럼을 시간 순으로 병합하여 스펙트로그램을 생성할 수 있다.Specifically, the server 100 may divide the voice data received from the user terminal 200 based on a predetermined time unit. Subsequently, the server 100 may generate a plurality of spectra by converting the plurality of divided voice data into a frequency domain, respectively, and may generate a spectrogram by merging the generated spectra in chronological order.

이어서, 서버(100)는 미리 정해진 메시지(예를 들어, “카메라 정면에 얼굴을 향해 주세요”)를 포함하는 음성데이터의 주파수 패턴을 생성한다(S323). 이때, 서버(100)는 미리 정해진 메시지가 포함된 음성데이터의 샘플을 변환하여, 미리 정해진 메시지에 대응되는 주파수 패턴을 생성할 수 있다.Next, the server 100 generates a frequency pattern of the voice data including a predetermined message (eg, "Please face the camera in front of") (S323). In this case, the server 100 may generate a frequency pattern corresponding to the predetermined message by converting the sample of the voice data including the predetermined message.

이어서, 서버(100)는 S321 단계에서 생성된 스펙트로그램과, S323 단계에서 생성된 주파수 패턴을 비교하여, 상기 주파수 패턴과 가장 유사한 시간영역 상의 제1 구간을 도출한다(S325).Next, the server 100 compares the spectrogram generated in step S321 with the frequency pattern generated in step S323 to derive a first section in the time domain most similar to the frequency pattern ( S325 ).

이때, 서버(100)는 스펙트로그램에서 미리 정해진 시간단위 별로 주파수 패턴과의 유사도를 도출할 수 있다. 이어서, 서버(100)는 스펙트로그램에서 주파수 패턴과 유사도가 가장 높은 구간을 제1 구간으로 선택할 수 있다.In this case, the server 100 may derive the similarity with the frequency pattern for each predetermined time unit from the spectrogram. Subsequently, the server 100 may select a section having the highest similarity to the frequency pattern in the spectrogram as the first section.

도 5는 도 4의 S321 단계에서 스펙트로그램을 생성하는 몇몇 예시를 설명하기 위한 도면이다. FIG. 5 is a view for explaining some examples of generating a spectrogram in step S321 of FIG. 4 .

도 5를 참조하면, (a11)은 미리 정해진 시간단위의 윈도우로 분할된 음성데이터를 나타내고, (a12)는 (a11)에서 분할된 음성데이터를 주파수 영역으로 변환한 스펙트럼을 시계열적으로 이어 붙여 만들어진 스펙토그램을 나타낸다. Referring to FIG. 5, (a11) shows voice data divided into a window of a predetermined time unit, and (a12) is created by time-series concatenating the spectrum converted from the voice data divided in (a11) into the frequency domain. Shows the spectogram.

이때, 서버(100)는 STFT(Short Time Fourier Transform, 국소 푸리에 변환)를 이용하여 음성데이터를 주파수 영역으로 변환할 수 있다. 여기에서, STFT란, 데이터에서 시간에 대해 구간을 짧게 나눈 후, 나누어진 여러 구간의 데이터에 대해 푸리에 변환을 실시하여 단위시간에 따른 주파수 분포를 이미지화 하는 방법이다.In this case, the server 100 may transform the voice data into a frequency domain using a Short Time Fourier Transform (STFT). Here, STFT is a method of imaging a frequency distribution according to unit time by dividing the data into short sections with respect to time, and then performing a Fourier transform on the divided data of several sections.

구체적으로, 서버(100)는 사용자 단말(200)로부터 수신한 음성데이터를 미리 정해진 시간단위로 나눌 수 있다. 이하에서는, 설명의 편의를 위해 미리 정해진 시간단위를 3.3초라고 가정하고 설명하도록 한다.Specifically, the server 100 may divide the voice data received from the user terminal 200 into predetermined time units. Hereinafter, for convenience of explanation, it is assumed that the predetermined time unit is 3.3 seconds.

예를 들어, (a11)를 참조하면, 서버(100)는 10초 길이의 음성데이터를 3.3초 단위로 나눌 수 있다. 이때, 서버(100)는 음성데이터의 0초 내지 3.3초에 해당하는 구간을 제1 윈도우(W11)로 설정할 수 있고, 3.4초 내지 6.6초에 해당하는 구간을 제2 윈도우(W12)로 설정할 수 있다. 또한, 서버(100)는 6.8초 내지 10초에 해당하는 구간을 제3 윈도우(W13)로 설정할 수 있다. 여기에서, 윈도우의 가로길이(Window length)는 미리 정해진 시간단위이다. 즉, 제1 윈도우 내지 제3 윈도우(W11 내지 W13)의 가로길이는 3.3초일 수 있다.For example, referring to (a11), the server 100 may divide 10-second-long voice data into units of 3.3 seconds. In this case, the server 100 may set a section corresponding to 0 seconds to 3.3 seconds of the voice data as the first window W11, and set a section corresponding to 3.4 seconds to 6.6 seconds as the second window W12. there is. Also, the server 100 may set a section corresponding to 6.8 seconds to 10 seconds as the third window W13. Here, the window length is a predetermined time unit. That is, the horizontal length of the first to third windows W11 to W13 may be 3.3 seconds.

이어서, 서버(100)는 제1 윈도우 내지 제3 윈도우(W11 내지 W13)를 주파수 영역으로 변환하여 각각의 스펙트럼을 생성할 수 있다. 구체적으로, 서버(100)는 제1 윈도우(W11)에 해당하는 제1 음성데이터를 주파수 영역으로 변환하여 제1 스펙트럼(S11)을 생성할 수 있다. 이어서, 서버(100)는 제2 윈도우(W12)의 제2 음성데이터를 변환하여 제2 스펙트럼(S12)을 생성하고, 제3 윈도우(W13)의 제3 음성데이터를 변환하여 제3 스펙트럼(S13)을 생성할 수 있다.Subsequently, the server 100 may generate each spectrum by converting the first to third windows W11 to W13 into a frequency domain. Specifically, the server 100 may generate the first spectrum S11 by converting the first voice data corresponding to the first window W11 into the frequency domain. Next, the server 100 converts the second voice data of the second window W12 to generate a second spectrum S12, and converts the third voice data of the third window W13 to convert the third spectrum S13 ) can be created.

이어서, 서버(100)는 생성된 제1 스펙트럼 내지 제3 스펙트럼(S11 내지 S13)을 시계열순으로 병합하여 음성데이터에 대한 스펙트로그램(a12)을 생성할 수 있다.Subsequently, the server 100 may generate a spectrogram a12 of the voice data by merging the generated first to third spectra S11 to S13 in a time series order.

한편, 서버(100)는 음성데이터에 대해 오버랩(Overlap)된 윈도우를 적용한 STFT 분석을 수행할 수 있다. 이때, 복수의 윈도우는 음성데이터의 시간영역에서 오버랩 될 수 있으며, 오버랩되는 길이는 미리 설정되거나, 윈도우의 비율로 특정될 수 있다. Meanwhile, the server 100 may perform STFT analysis to which an overlapped window is applied to voice data. In this case, the plurality of windows may overlap in the time domain of the voice data, and the overlapping length may be preset or specified as a ratio of the windows.

예를 들어, (a21)을 참조하면, 서버(100)는 10초 길이의 음성데이터를 3.3초 단위로 나눌 수 있다. 서버(100)는 음성데이터의 0초 내지 3.3초에 해당하는 구간을 제1 윈도우(W21)로 설정할 수 있다. For example, referring to (a21), the server 100 may divide 10-second-long voice data into units of 3.3 seconds. The server 100 may set a section corresponding to 0 seconds to 3.3 seconds of the voice data as the first window W21.

이어서, 서버(100)는 제1 윈도우(W21)에 오버랩 되는 제2 윈도우(W22)를 설정할 수 있다. 이때, 제2 윈도우(W22)는 2.2초 내지 5.5초에 해당하는 구간에 위치할 수 있다. Subsequently, the server 100 may set the second window W22 overlapping the first window W21 . In this case, the second window W22 may be located in a section corresponding to 2.2 seconds to 5.5 seconds.

또한, 서버(100)는 제2 윈도우(W22)에 오버랩 되는 제3 윈도우(S23)와, 제3 윈도우(W23)에 오버랩 되는 제4 윈도우(W24)를 설정할 수 있다. Also, the server 100 may set a third window S23 overlapping the second window W22 and a fourth window W24 overlapping the third window W23 .

이어서, 서버(100)는 제1 윈도우 내지 제4 윈도우(W21 내지 W24)를 주파수 영역으로 변환하여, 각각의 스펙트럼(S21 내지 S24)을 생성할 수 있다. Subsequently, the server 100 may convert the first to fourth windows W21 to W24 into a frequency domain to generate respective spectra S21 to S24 .

이어서, 서버(100)는 생성된 복수의 스펙트럼(S21 내지 S24)을 시계열순으로 병합하여 음성데이터에 대한 스펙트로그램(a22)을 생성할 수 있다. Subsequently, the server 100 may generate a spectrogram a22 of the voice data by merging the generated spectra S21 to S24 in a time series order.

이때, 각각의 스펙트럼은 일측에 배치된 윈도우와 오버랩되는 시간구간을 뺀 나머지 구간에 배치될 수 있다. 예를 들어, 제1 윈도우(W21)의 단위시간은 0초 내지 3.3초이나, 일측에 위치하는 제2 윈도우(W22)와 오버랩되는 구간을 뺀, 0초 내지 2.2초에 해당하는 위치에 변환된 제1 스팩트럼(S21)이 배치될 수 있다.In this case, each spectrum may be disposed in the remaining section except for the time section overlapping the window disposed on one side. For example, the unit time of the first window W21 is 0 seconds to 3.3 seconds, but it is converted to a position corresponding to 0 seconds to 2.2 seconds, excluding a section overlapping with the second window W22 located on one side. A first spectrum S21 may be disposed.

또한, 생성된 스펙트로그램(a22)을 살펴보면, 각 스펙트럼은 양쪽에 위치한 각 스펙트럼들의 주파수 영역과 일부 겹치는 것을 확인할 수 있다. Also, looking at the generated spectrogram a22, it can be seen that each spectrum partially overlaps with the frequency domain of each spectrum located on both sides.

이렇게 시간영역에서 오버랩되는 윈도우를 이용함으로써, 본 발명은 제1 구간을 더 세밀하게 도출할 수 있어, 미리 정해진 메시지과 매칭되는 구간을 도출하는데 있어 정확도를 향상시킬 수 있다.By using such overlapping windows in the time domain, the present invention can derive the first section more precisely, thereby improving accuracy in deriving a section matching a predetermined message.

도 6은 도 4의 안면 검출 방법을 통해 생성된 스펙트로그램을 설명하기 위한 도면이다.FIG. 6 is a view for explaining a spectrogram generated through the face detection method of FIG. 4 .

도 6을 참고하면, 서버(100)는 사용자 단말(200)로부터 수신한 음성데이터에 대해 전술한 도 5의 과정을 통하여 스펙트로그램을 생성할 수 있다.Referring to FIG. 6 , the server 100 may generate a spectrogram through the process of FIG. 5 for voice data received from the user terminal 200 .

서버(100)는 생성된 스펙트로그램에서, 미리 정해진 메시지를 포함하는 음성데이터에 관한 주파수 패턴과, 유사도가 가장 높은 구간을 도출할 수 있다. 예를 들어, 서버(100)는 스펙트로그램을 미리 정해진 구간별로 구분하고, 구분된 각 구간에 대한 스펙트럼과 주파수 패턴 간의 유사도를 산출할 수 있다. The server 100 may derive, from the generated spectrogram, a section having the highest similarity to a frequency pattern related to voice data including a predetermined message. For example, the server 100 may classify the spectrogram for each predetermined section, and calculate the similarity between the spectrum and the frequency pattern for each divided section.

이어서, 서버(100)는 산출된 유사도가 제일 높은 스펙트럼이 속한 구간을 제1 구간으로 선정할 수 있다.Subsequently, the server 100 may select a section to which the calculated spectrum having the highest similarity belongs as the first section.

추가적으로, 서버(100)는 제1 구간을 도출하는데 있어, 로그 멜 스펙트로그램(Log mel spectrogram) 또는 립로사(LibROSA)를 이용할 수 있다. 다만, 이는 하나의 예시에 불과하며, 제1 구간을 도출하기 위한 다양한 알고리즘이 이용될 수 있음은 물론이다.Additionally, the server 100 may use a log mel spectrogram or LibROSA in deriving the first section. However, this is only an example, and it goes without saying that various algorithms for deriving the first section may be used.

이하에서는 본 발명의 다른 실시예에 따른 딥러닝 모듈을 이용하여 제1 구간을 도출하는 안면 검출 방법에 대해 설명하도록 한다.Hereinafter, a face detection method for deriving the first section using a deep learning module according to another embodiment of the present invention will be described.

도 7은 도 3의 S220 단계에 따른 제1 구간을 도출하는 방법의 다른 예를 설명하기 위한 도면이다.FIG. 7 is a view for explaining another example of a method of deriving a first section according to step S220 of FIG. 3 .

도 7을 참고하면, 서버(100)는 음성데이터를 특정 시간단위의 구간으로 샘플링한다(S421). 구체적으로, 서버(100)는 샘플링 모듈에 사용자 단말(200)로부터 수신한 음성데이터를 입력할 수 있다. 샘플링 모듈은 입력된 음성데이터를 기초로 미리 설정된 특정 시간단위로 음성데이터를 구간별로 나누어 출력할 수 있다.Referring to FIG. 7 , the server 100 samples voice data in a section of a specific time unit ( S421 ). Specifically, the server 100 may input the voice data received from the user terminal 200 to the sampling module. The sampling module may divide and output the voice data for each section in a preset specific time unit based on the input voice data.

이어서, 서버(100)는 미리 정해진 메시지를 포함하는 음성 패턴을 생성한다(S423). 서버(100)는 미리 정해진 메시지(예를 들어, “카메라 정면에 얼굴을 향해 주세요”)가 포함된 음성데이터의 일부를 음성 패턴으로 설정할 수 있다.Next, the server 100 generates a voice pattern including a predetermined message (S423). The server 100 may set a part of voice data including a predetermined message (eg, “Please face the camera in front of”) as a voice pattern.

이어서, 서버(100)는 딥러닝 모듈을 이용하여, 샘플링된 구간 별 음성데이터와 음성 패턴을 기초로 구간 별 음성 유사도를 추출한다(S425). 이때, 딥러닝 모듈의 입력 노드에는 샘플링된 구간 별 음성데이터 및 음성 패턴이 입력되고, 출력 노드에는 음성 유사도가 출력될 수 있다.Next, the server 100 extracts the voice similarity for each section based on the sampled voice data for each section and the voice pattern using the deep learning module (S425). In this case, voice data and voice patterns for each sampled section may be input to the input node of the deep learning module, and voice similarity may be output to the output node.

이어서, 서버(100)는 딥러닝 모듈에서 출력된 음성 유사도가 미리 정해진 기준치보다 높은 구간을 도출하여 제1 구간으로 설정한다(S427). 이때, 서버(100)는 음성 유사도가 미리 정해진 기준치보다 높은 구간 중 음성 유사도가 가장 높은 구간을 제1 구간으로 도출할 수 있다. Next, the server 100 derives a section in which the voice similarity output from the deep learning module is higher than a predetermined reference value and sets it as the first section (S427). In this case, the server 100 may derive a section having the highest voice similarity among sections in which the voice similarity is higher than a predetermined reference value as the first section.

도 8은 도 7의 안면 검출 방법에서 이용되는 딥러닝 모듈을 개략적으로 설명하기 위한 블록도이다.8 is a block diagram schematically illustrating a deep learning module used in the face detection method of FIG. 7 .

구체적으로, 도 8을 참조하면, 딥러닝 모듈(DM)은 구간 별 음성데이터 및 음성 패턴을 입력받고, 이에 대한 출력으로 구간 별 음성 유사도를 출력할 수 있다. Specifically, referring to FIG. 8 , the deep learning module (DM) may receive voice data and voice patterns for each section, and output the voice similarity for each section as an output thereof.

이때, 구간 별 음성데이터는 샘플링 모듈(SM)에 의해 생성될 수 있다. 샘플링 모듈(SM)은 사용자 단말(200)로부터 입력받은 음성데이터를 미리 설정된 구간별로 나누어지도록 샘플링할 수 있다. 샘플링 모듈(SM)을 통해 출력된 구간 별 음성데이터는 딥러닝 모듈(DM)에 입력될 수 있다. 또한, 음성 패턴은 미리 정해진 메시지(예를 들어, “카메라 정면에 얼굴을 향해 주세요”)가 포함된 음성데이터를 의미한다. In this case, the voice data for each section may be generated by the sampling module SM. The sampling module SM may sample the voice data received from the user terminal 200 to be divided into preset sections. Voice data for each section output through the sampling module SM may be input to the deep learning module DM. In addition, the voice pattern refers to voice data including a predetermined message (eg, “Please point your face in front of the camera”).

딥러닝 모듈(DM)은 빅데이터를 기초로 학습된 인공신경망을 이용하여, 음성 패턴에 대한 구간 별 음성데이터의 유사도(즉, 구간 별 음성 유사도)를 도출할 수 있다.The deep learning module (DM) can derive the similarity of voice data for each section (ie, the voice similarity for each section) with respect to a voice pattern by using an artificial neural network learned based on big data.

딥러닝 모듈(DM)은 입력된 데이터를 기초로 도출된 별도의 파라미터에 대한 매핑 데이터를 이용하여 인공신경망 학습을 수행할 수 있다. 딥러닝 모듈(DM)은 학습 인자로 입력되는 파라미터들에 대하여 머신 러닝(machine learning)을 수행할 수 있다. 이때, 서버(100)의 메모리에는 머신 러닝에 사용되는 데이터 및 결과 데이터 등이 저장될 수 있다.The deep learning module (DM) may perform artificial neural network learning using mapping data for a separate parameter derived based on input data. The deep learning module DM may perform machine learning on parameters input as learning factors. In this case, data used for machine learning and result data may be stored in the memory of the server 100 .

보다 자세히 설명하자면, 머신 러닝(Machine Learning)의 일종인 딥러닝(Deep Learning) 기술은 데이터를 기반으로 다단계로 깊은 수준까지 내려가 학습하는 것이다.To be more specific, Deep Learning, a type of machine learning, learns by going down to a deep level in multiple stages based on data.

딥러닝(Deep learning)은, 단계를 높여가면서 복수의 데이터들로부터 핵심적인 데이터를 추출하는 머신 러닝(Machine Learning) 알고리즘의 집합을 나타낸다.Deep learning refers to a set of machine learning algorithms that extract core data from a plurality of data while increasing the level.

딥러닝 모듈(DM)은 공지된 다양한 딥러닝 구조를 이용할 수 있다. 예를 들어, 딥러닝 모듈(DM)은 CNN(Convolutional Neural Network), RNN(Recurrent Neural Network), DBN(Deep Belief Network), GNN(Graph Neural Network) 등의 구조를 이용할 수 있다.The deep learning module (DM) may use various well-known deep learning structures. For example, the deep learning module (DM) may use a structure such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Deep Belief Network (DBN), or a Graph Neural Network (GNN).

구체적으로, CNN(Convolutional Neural Network)은 사람이 물체를 인식할 때 물체의 기본적인 특징들을 추출한 다음 뇌 속에서 복잡한 계산을 거쳐 그 결과를 기반으로 물체를 인식한다는 가정을 기반으로 만들어진 사람의 뇌 기능을 모사한 모델이다.Specifically, CNN (Convolutional Neural Network) is a human brain function created based on the assumption that when a person recognizes an object, it extracts the basic features of the object, then performs complex calculations in the brain and recognizes the object based on the result. It is a simulated model.

RNN(Recurrent Neural Network)은 자연어 처리 등에 많이 이용되며, 시간의 흐름에 따라 변하는 시계열 데이터(Time-series data) 처리에 효과적인 구조로 매 순간마다 레이어를 쌓아올려 인공신경망 구조를 구성할 수 있다.RNN (Recurrent Neural Network) is widely used for natural language processing, etc., and is an effective structure for processing time-series data that changes with time.

DBN(Deep Belief Network)은 딥러닝 기법인 RBM(Restricted Boltzman Machine)을 다층으로 쌓아 구성되는 딥러닝 구조이다. RBM(Restricted Boltzman Machine) 학습을 반복하여 일정 수의 레이어가 되면, 해당 개수의 레이어를 가지는 DBN(Deep Belief Network)이 구성될 수 있다.DBN (Deep Belief Network) is a deep learning structure composed of multi-layered Restricted Boltzman Machine (RBM), a deep learning technique. When a certain number of layers is obtained by repeating Restricted Boltzman Machine (RBM) learning, a Deep Belief Network (DBN) having the corresponding number of layers may be configured.

GNN(Graphic Neural Network, 그래픽 인공신경망, 이하, GNN)는 특정 파라미터 간 매핑된 데이터를 기초로 모델링된 모델링 데이터를 이용하여, 모델링 데이터 간의 유사도와 특징점을 도출하는 방식으로 구현된 인공신경망 구조를 나타낸다.GNN (Graphic Neural Network, hereinafter, GNN) represents an artificial neural network structure implemented in such a way that similarities and feature points between modeling data are derived using modeling data modeled based on data mapped between specific parameters. .

한편, 딥러닝 모듈(DM)의 인공신경망 학습은 주어진 입력에 대하여 원하는 출력이 나오도록 노드간 연결선의 웨이트(weight)를 조정(필요한 경우 바이어스(bias) 값도 조정)함으로써 이루어질 수 있다. 또한, 인공신경망은 학습에 의해 웨이트(weight) 값을 지속적으로 업데이트시킬 수 있다. 또한, 인공신경망의 학습에는 역전파(Back Propagation) 등의 방법이 사용될 수 있다.On the other hand, artificial neural network learning of the deep learning module (DM) can be made by adjusting the weight of the connection line between nodes (and adjusting the bias value if necessary) so that a desired output is obtained for a given input. In addition, the artificial neural network may continuously update a weight value by learning. In addition, a method such as back propagation may be used for learning the artificial neural network.

한편, 서버(100)의 메모리에는 머신 러닝으로 미리 학습된 인공신경망(Artificial Neural Network)이 탑재될 수 있다.On the other hand, the memory of the server 100 may be loaded with an artificial neural network (Artificial Neural Network) pre-trained by machine learning.

딥러닝 모듈(DM)은 도출된 파라미터에 대한 모델링 데이터를 입력 데이터로 하는 머신 러닝(machine learning) 기반의 개선 프로세스 추천 동작을 수행할 수 있다. 이때, 인공신경망의 머신 러닝 방법으로는 준지도학습(semi-supervised learning)과 지도학습(supervised learning)이 모두 사용될 수 있다. 또한, 딥러닝 모듈(DM)은 설정에 따라 학습 후 구간 별 음성 유사도를 출력하기 위한 인공신경망 구조를 자동 업데이트하도록 제어될 수 있다.The deep learning module (DM) may perform a machine learning-based improvement process recommendation operation using modeling data for the derived parameters as input data. In this case, both semi-supervised learning and supervised learning may be used as the machine learning method of the artificial neural network. In addition, the deep learning module (DM) may be controlled to automatically update the artificial neural network structure for outputting voice similarity for each section after learning according to settings.

추가적으로, 도면에 명확하게 도시하지는 않았으나, 본 발명의 다른 실시예에서, 딥러닝 모듈(DM)의 동작은 서버(100) 또는 별도의 클라우드 서버(미도시)에서 실시될 수 있다. 이하에서는, 전술한 본 발명의 실시예에 따른 딥러닝 모듈(DM)의 구성에 대해 살펴보도록 한다.Additionally, although not clearly shown in the drawings, in another embodiment of the present invention, the operation of the deep learning module (DM) may be performed in the server 100 or a separate cloud server (not shown). Hereinafter, the configuration of the deep learning module (DM) according to the embodiment of the present invention will be described.

도 9는 도 8의 딥러닝 모듈의 구성을 도시한 도면이다.9 is a diagram illustrating the configuration of the deep learning module of FIG. 8 .

도 9를 참조하면, 딥러닝 모듈(DM)은 구간 별 음성데이터 및 음성 패턴을 입력노드로 하는 입력 레이어(input)와, 구간 별 음성 유사도를 출력노드로 하는 출력 레이어(Output)와, 입력 레이어와 출력 레이어 사이에 배치되는 M 개의 히든 레이어를 포함한다.Referring to FIG. 9 , the deep learning module (DM) includes an input layer using voice data and voice patterns for each section as input nodes, an output layer using voice similarity for each section as an output node, and an input layer. and M hidden layers disposed between the output layer and the output layer.

여기서, 각 레이어들의 노드를 연결하는 에지(edge)에는 가중치가 설정될 수 있다. 이러한 가중치 혹은 에지의 유무는 학습 과정에서 추가, 제거, 또는 업데이트 될 수 있다. 따라서, 학습 과정을 통하여, k개의 입력노드와 i개의 출력노드 사이에 배치되는 노드들 및 에지들의 가중치는 업데이트될 수 있다.Here, a weight may be set on an edge connecting the nodes of each layer. The presence or absence of such weights or edges may be added, removed, or updated during the learning process. Accordingly, through the learning process, weights of nodes and edges disposed between k input nodes and i output nodes may be updated.

딥러닝 모듈(DM)이 학습을 수행하기 전에는 모든 노드와 에지는 초기값으로 설정될 수 있다. 그러나, 누적하여 정보가 입력될 경우, 노드 및 에지들의 가중치는 변경되고, 이 과정에서 학습인자로 입력되는 파라미터들(즉, 구간 별 음성데이터 및 음성 패턴)과 출력노드로 할당되는 값(즉, 구간 별 음성 유사도) 사이의 매칭이 이루어질 수 있다.Before the deep learning module (DM) performs learning, all nodes and edges can be set to initial values. However, when accumulated information is input, the weights of nodes and edges are changed, and in this process, parameters input as learning factors (ie, voice data and voice patterns for each section) and values assigned to output nodes (that is, Matching between voice similarities for each section can be made.

추가적으로, 클라우드 서버(미도시)를 이용하는 경우, 딥러닝 모듈(DM)은 많은 수의 파라미터들을 수신하여 처리할 수 있다. 따라서, 딥러닝 모듈(DM)은 방대한 데이터에 기반하여 학습을 수행할 수 있다.Additionally, when using a cloud server (not shown), the deep learning module (DM) may receive and process a large number of parameters. Therefore, the deep learning module (DM) can perform learning based on massive data.

딥러닝 모듈(DM)을 구성하는 입력노드와 출력노드 사이의 노드 및 에지의 가중치는 딥러닝 모듈(DM)의 학습 과정에 의해 업데이트될 수 있다. 또한, 딥러닝 모듈(DM)에서 출력되는 파라미터는 구간 별 음성 유사도 외에도 다양한 데이터로 추가 확장될 수 있음은 물론이다.The weights of nodes and edges between the input and output nodes constituting the deep learning module (DM) may be updated by the learning process of the deep learning module (DM). In addition, it goes without saying that the parameters output from the deep learning module (DM) can be further extended to various data in addition to the voice similarity for each section.

이어서, 서버(100)는 제1 구간을 기준으로 제2 구간을 설정하고, 제2 구간에 대응되는 영상데이터의 일부를 추출할 수 있다. 이에 대한 자세한 설명은 전술하였으므로, 중복되는 설명은 생략하도록 한다.Subsequently, the server 100 may set a second section based on the first section and extract a part of image data corresponding to the second section. Since the detailed description thereof has been described above, a redundant description thereof will be omitted.

이하에서는, 추출된 영상데이터에서 미리 설정된 기준을 만족하는 영상 프레임을 추출하고, 도출된 영상 프레임에 포함된 안면 이미지를 검출하는 방법에 대한 몇몇 예시에 대해 설명하도록 한다.Hereinafter, some examples of a method of extracting an image frame satisfying a preset criterion from the extracted image data and detecting a facial image included in the derived image frame will be described.

도 10은 도 3의 S250 단계 및 S260 단계에 대한 몇몇 예시를 설명하기 위한 순서도이다.10 is a flowchart for explaining some examples of steps S250 and S260 of FIG. 3 .

도 10을 참조하면, 본 발명의 일 실시예에서, 서버(100)는 제2 구간에 대한 영상데이터에 대하여, 일정 시간 간격(예를 들어, 1/n 프레임 간격)으로 영상 프레임을 도출할 수 있다(S551).Referring to FIG. 10 , in an embodiment of the present invention, the server 100 may derive an image frame at a predetermined time interval (eg, 1/n frame interval) with respect to the image data for the second section. There is (S551).

이때, 서버(100)는 영상 프레임의 도출을 위한 프레임 도출주기를 미리 설정할 수 있다. 예를 들어, 도출주기가 10으로 설정된 경우, 서버(100)는 제2 구간의 영상데이터에 포함된 10개의 영상 프레임 마다 1개의 영상 프레임을 도출할 수 있다. 다만, 이는 하나의 예시에 불과하며, 영상 프레임의 도출주기는 가변되거나, 랜덤하게 형성될 수 있음은 물론이다.In this case, the server 100 may preset a frame derivation period for deriving an image frame. For example, when the derivation period is set to 10, the server 100 may derive one image frame for every 10 image frames included in the image data of the second section. However, this is only an example, and it goes without saying that the derivation period of the image frame may be variable or may be randomly formed.

한편, 본 발명의 다른 실시예에서, 서버(100)는 제2 구간에 대한 영상데이터에 대하여, 영상데이터의 옵티컬 플로우가 기준치 보다 작은 영상 프레임을 도출한다(S553). Meanwhile, in another embodiment of the present invention, the server 100 derives an image frame having an optical flow of the image data smaller than the reference value with respect to the image data for the second section (S553).

예를 들어, 서버(100)는 제2 구간의 영상데이터에서 제1 프레임과 제2 프레임을 추출하고, 각 영상 프레임 내에서 하나 이상의 특징점을 기준으로 벡터 형식의 옵티컬 플로우를 추출할 수 있다. 이때, 서버(100)는 벡터의 절대값을 계산하여 옵티컬 플로우의 크기를 산출할 수 있다. 이어서, 서버(100)는 산출된 옵티컬 플로우의 크기가 미리 설정된 기준치보다 작은 경우, 해당 옵티컬 플로우를 포함하는 영상 프레임을 도출할 수 있다. For example, the server 100 may extract the first frame and the second frame from the image data of the second section, and extract the optical flow in a vector format based on one or more feature points within each image frame. In this case, the server 100 may calculate the absolute value of the vector to calculate the size of the optical flow. Subsequently, when the calculated size of the optical flow is smaller than a preset reference value, the server 100 may derive an image frame including the corresponding optical flow.

다만, 이는 영상 프레임을 도출하는 몇몇 예시에 불과하고, 본 발명이 위 방법에 제한되는 것은 아니다.However, these are only some examples of deriving an image frame, and the present invention is not limited to the above method.

이어서, 서버(100)는 추출된 영상 프레임에서 사용자의 안면 이미지를 검출한다. 서버(100)는 미리 학습된 딥러닝 모델(예를 들어, MTCNN, Retinaface, 또는 Blazeface)을 이용하여 사용자의 안면 이미지를 검출할 수 있다. 사용자의 안면 이미지는 영상 프레임 내에서 바운딩 박스를 이용하여 검출될 수 있다. 이때, 서버(100)에서 사용되는 딥러닝 모델은 다양하게 변형되어 사용될 수 있다.Next, the server 100 detects the user's face image from the extracted image frame. The server 100 may detect the user's face image using a pre-trained deep learning model (eg, MTCNN, Retinaface, or Blazeface). The user's face image may be detected using a bounding box within the image frame. In this case, the deep learning model used in the server 100 may be variously modified and used.

이어서, 서버(100)는 도출된 각 영상 프레임에 대한 안면 랜드마크를 도출한다(S561). 예를 들어, 서버(100)는 영상 프레임에 표시된 안면에서 눈, 코, 입, 턱선 또는 콧대를 도출할 수 있다.Next, the server 100 derives a facial landmark for each derived image frame (S561). For example, the server 100 may derive the eyes, nose, mouth, jaw line, or bridge of the nose from the face displayed in the image frame.

이어서, 서버(100)는 도출된 랜드마크를 기초로 안면 정렬을 위한 보정을 수행한다(S563). 예를 들어, 서버(100)는 도출된 랜드마크 중 좌측 눈의 시작부분과 우측 눈의 시작부분을 선으로 연결하여 직선을 생성할 수 있다. 이어서, 서버(100)는 생성된 직선과 수평기준선 사이의 각도를 측정할 수 있다. 서버(100)는 측정된 각도와 동일한 크기의 반대각도로 도출된 안면 이미지를 회전시킴으로써, 안면 이미지를 정렬할 수 있다. 다만, 이는 하나의 예시에 불과하고, 본 발명이 위의 방법에 한정되는 것은 아니다.Then, the server 100 performs correction for facial alignment based on the derived landmark (S563). For example, the server 100 may generate a straight line by connecting the start part of the left eye and the start part of the right eye among the derived landmarks with a line. Subsequently, the server 100 may measure the angle between the generated straight line and the horizontal reference line. The server 100 may align the face image by rotating the derived face image at an angle opposite to the measured angle. However, this is only an example, and the present invention is not limited to the above method.

이어서, 서버(100)는 안면 정렬을 위한 보정이 수행된 이미지에서 특징점을 추출한다(S565). 이때, 특징점은 이미 공개된 다양한 알고리즘에 의해 추출될 수 있으므로, 여기에서 자세한 설명은 생략하도록 한다.Next, the server 100 extracts a feature point from the image corrected for facial alignment (S565). In this case, since the feature points may be extracted by various algorithms that have already been disclosed, a detailed description thereof will be omitted herein.

이어서, 서버(100)는 사용자의 신분증 이미지에서 추출된 특징점과 보정된 이미지에서 추출된 특징점을 비교함으로써 안면 유사도를 산출할 수 있다. 이렇게 산출된 안면 유사도는, 사용자 얼굴의 동일성 판단에 이용될 수 있다.Subsequently, the server 100 may calculate the facial similarity by comparing the feature points extracted from the user's ID image with the feature points extracted from the corrected image. The calculated facial similarity may be used to determine the identity of the user's face.

도 11은 본 발명의 몇몇 실시예에 따른 안면 검출 방법을 수행하는 시스템의 하드웨어 구현을 설명하기 위한 도면이다.11 is a diagram for explaining a hardware implementation of a system for performing a face detection method according to some embodiments of the present invention.

도 11을 참조하면, 본 발명의 몇몇 실시예들에 따른 안면 검출 방법을 수행하는 서버(100)는 전자 장치(1000)로 구현될 수 있다. 전자 장치(1000)는 컨트롤러(1010), 입출력 장치(1220, I/O), 메모리 장치(1230, memory device), 인터페이스(1040) 및 버스(1250, bus)를 포함할 수 있다. 컨트롤러(1010), 입출력 장치(1020), 메모리 장치(1030) 및/또는 인터페이스(1040)는 버스(1050)를 통하여 서로 결합될 수 있다. 버스(1050)는 데이터들이 이동되는 통로(path)에 해당한다.Referring to FIG. 11 , the server 100 performing the face detection method according to some embodiments of the present disclosure may be implemented as an electronic device 1000 . The electronic device 1000 may include a controller 1010 , an input/output device 1220 , I/O, a memory device 1230 , an interface 1040 , and a bus 1250 . The controller 1010 , the input/output device 1020 , the memory device 1030 , and/or the interface 1040 may be coupled to each other through the bus 1050 . The bus 1050 corresponds to a path through which data is moved.

구체적으로, 컨트롤러(1010)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit), 마이크로프로세서, 디지털 신호 프로세스, 마이크로컨트롤러, 어플리케이션 프로세서(AP, application processor) 및 이들과 유사한 기능을 수행할 수 있는 논리 소자들 중에서 적어도 하나를 포함할 수 있다. Specifically, the controller 1010 includes a central processing unit (CPU), a micro processor unit (MPU), a micro controller unit (MCU), a graphic processing unit (GPU), a microprocessor, a digital signal processor, a microcontroller, an application processor (AP). , application processor), and at least one of logic devices capable of performing a function similar thereto.

입출력 장치(1020)는 키패드(keypad), 키보드, 터치스크린 및 디스플레이 장치 중 적어도 하나를 포함할 수 있다. 메모리 장치(1030)는 데이터 및/또는 프로그램 등을 저장할 수 있다.The input/output device 1020 may include at least one of a keypad, a keyboard, a touch screen, and a display device. The memory device 1030 may store data and/or a program.

인터페이스(1040)는 통신 네트워크로 데이터를 전송하거나 통신 네트워크로부터 데이터를 수신하는 기능을 수행할 수 있다. 인터페이스(1040)는 유선 또는 무선 형태일 수 있다. 예컨대, 인터페이스(1040)는 안테나 또는 유무선 트랜시버 등을 포함할 수 있다. 도시하지 않았지만, 메모리 장치(1030)는 컨트롤러(1010)의 동작을 향상시키기 위한 동작 메모리로서, 고속의 디램 및/또는 에스램 등을 더 포함할 수도 있다. 메모리 장치(1030)는 내부에 프로그램 또는 어플리케이션을 저장할 수 있다. The interface 1040 may perform a function of transmitting data to or receiving data from a communication network. The interface 1040 may be in a wired or wireless form. For example, the interface 1040 may include an antenna or a wired/wireless transceiver. Although not shown, the memory device 1030 is a working memory for improving the operation of the controller 1010 , and may further include a high-speed DRAM and/or SRAM. The memory device 1030 may store a program or an application therein.

사용자 단말(200)은 개인 휴대용 정보 단말기(PDA, personal digital assistant) 포터블 컴퓨터(portable computer), 웹 타블렛(web tablet), 무선 전화기(wireless phone), 모바일 폰(mobile phone), 디지털 뮤직 플레이어(digital music player), 메모리 카드(memory card), 또는 정보를 무선환경에서 송신 및/또는 수신할 수 있는 모든 전자 제품에 적용될 수 있다.The user terminal 200 includes a personal digital assistant (PDA), a portable computer, a web tablet, a wireless phone, a mobile phone, and a digital music player (digital). music player), a memory card, or any electronic product capable of transmitting and/or receiving information in a wireless environment.

또는, 본 발명의 실시예들에 따른 서버(100) 및 사용자 단말(200)은 각각 복수의 전자 장치(1000)가 네트워크를 통해서 서로 연결되어 형성된 시스템일 수 있다. 이러한 경우에는 각각의 모듈 또는 모듈의 조합들이 전자 장치(1000)로 구현될 수 있다. 단, 본 실시예가 이에 제한되는 것은 아니다.Alternatively, each of the server 100 and the user terminal 200 according to embodiments of the present invention may be a system in which a plurality of electronic devices 1000 are connected to each other through a network. In this case, each module or combinations of modules may be implemented as the electronic device 1000 . However, the present embodiment is not limited thereto.

추가적으로, 서버(100)는 워크스테이션(workstation), 데이터 센터, 인터넷 데이터 센터(internet data center(IDC)), DAS(direct attached storage) 시스템, SAN(storage area network) 시스템, NAS(network attached storage) 시스템 및 RAID(redundant array of inexpensive disks, or redundant array of independent disks) 시스템 중 적어도 하나로 구현될 수 있으나, 본 실시예가 이에 제한되는 것은 아니다.Additionally, the server 100 may include a workstation, a data center, an internet data center (IDC), a direct attached storage (DAS) system, a storage area network (SAN) system, and a network attached storage (NAS) system. It may be implemented as at least one of a system and a redundant array of inexpensive disks, or redundant array of independent disks (RAID) system, but the present embodiment is not limited thereto.

또한, 서버(100)는 사용자 단말(200)을 이용하여 네트워크를 통해서 데이터를 전송할 수 있다. 네트워크는 유선 인터넷 기술, 무선 인터넷 기술 및 근거리 통신 기술에 의한 네트워크를 포함할 수 있다. 유선 인터넷 기술은 예를 들어, 근거리 통신망(LAN, Local area network) 및 광역 통신망(WAN, wide area network) 중 적어도 하나를 포함할 수 있다.Also, the server 100 may transmit data through a network using the user terminal 200 . The network may include a network based on a wired Internet technology, a wireless Internet technology, and a short-range communication technology. Wired Internet technology may include, for example, at least one of a local area network (LAN) and a wide area network (WAN).

무선 인터넷 기술은 예를 들어, 무선랜(Wireless LAN: WLAN), DMNA(Digital Living Network Alliance), 와이브로(Wireless Broadband: Wibro), 와이맥스(World Interoperability for Microwave Access: Wimax), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), IEEE 802.16, 롱 텀 에볼루션(Long Term Evolution: LTE), LTE-A(Long Term Evolution-Advanced), 광대역 무선 이동 통신 서비스(Wireless Mobile Broadband Service: WMBS) 및 5G NR(New Radio) 기술 중 적어도 하나를 포함할 수 있다. 단, 본 실시예가 이에 제한되는 것은 아니다.Wireless Internet technology is, for example, wireless LAN (WLAN), DMNA (Digital Living Network Alliance), WiBro (Wireless Broadband: Wibro), Wimax (World Interoperability for Microwave Access: Wimax), HSDPA (High Speed Downlink Packet) Access), High Speed Uplink Packet Access (HSUPA), IEEE 802.16, Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), Wireless Mobile Broadband Service (WMBS) and 5G New Radio (NR) technology. However, the present embodiment is not limited thereto.

근거리 통신 기술은 예를 들어, 블루투스(Bluetooth), RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association: IrDA), UWB(Ultra-Wideband), 지그비(ZigBee), 인접 자장 통신(Near Field Communication: NFC), 초음파 통신(Ultra Sound Communication: USC), 가시광 통신(Visible Light Communication: VLC), 와이 파이(Wi-Fi), 와이 파이 다이렉트(Wi-Fi Direct), 5G NR (New Radio) 중 적어도 하나를 포함할 수 있다. 단, 본 실시예가 이에 제한되는 것은 아니다.Short-range communication technologies include, for example, Bluetooth, Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra-Wideband (UWB), ZigBee, and Near Field Communication: At least one of NFC), Ultra Sound Communication (USC), Visible Light Communication (VLC), Wi-Fi, Wi-Fi Direct, and 5G NR (New Radio) may include However, the present embodiment is not limited thereto.

네트워크를 통해서 통신하는 서버(100)는 이동통신을 위한 기술표준 및 표준 통신 방식을 준수할 수 있다. 예를 들어, 표준 통신 방식은 GSM(Global System for Mobile communication), CDMA(Code Division Multi Access), CDMA2000(Code Division Multi Access 2000), EV-DO(Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTEA(Long Term Evolution-Advanced) 및 5G NR(New Radio) 중 적어도 하나를 포함할 수 있다. 단, 본 실시예가 이에 제한되는 것은 아니다.The server 100 communicating through the network may comply with technical standards and standard communication methods for mobile communication. For example, standard communication methods include Global System for Mobile communication (GSM), Code Division Multi Access (CDMA), Code Division Multi Access 2000 (CDMA2000), and Enhanced Voice-Data Optimized or Enhanced Voice-Data Only (EV-DO). , at least one of Wideband CDMA (WCDMA), High Speed Downlink Packet Access (HSDPA), High Speed Uplink Packet Access (HSUPA), Long Term Evolution (LTE), Long Term Evolution-Advanced (LTEA), and 5G New Radio (NR) may include However, the present embodiment is not limited thereto.

정리하면, 본 발명의 안면 검출 방법은 주파수 영역으로 변환한 음성데이터를 이용하여 미리 정해진 메시지와 관련된 구간을 도출하거나, 미리 학습된 딥러닝 모듈을 이용하여 미리 정해진 메시지와 가장 관련도 높은 음성데이터의 구간을 도출할 수 있다. 이어서, 본 발명은 도출된 구간에 대응되는 영상데이터에 포함된 프레임 내에서 안면 이미지를 검출함으로써, 정면으로 정렬된 최적의 안면 이미지를 빠르게 탐색할 수 있다. In summary, the face detection method of the present invention derives a section related to a predetermined message using voice data converted into a frequency domain, or uses a pre-learned deep learning module of voice data most highly related to a predetermined message. section can be derived. Then, in the present invention, by detecting the face image within the frame included in the image data corresponding to the derived section, it is possible to quickly search for an optimal face image aligned in the front.

이에 따라, 본 발명은 안면 검출에 소요되는 시간을 단축시켜 사용자의 안면 검출 속도를 향상시키고, 안면 검출의 정확도를 높일 수 있으며, 시스템에 인가되는 부하를 감소시킬 수 있다.Accordingly, the present invention can shorten the time required for face detection, improve the user's face detection speed, increase the accuracy of the face detection, and reduce the load applied to the system.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

Claims

In the face detection method performed in the server associated with the user terminal,
receiving video data and audio data from the user terminal;
deriving a first section related to a predetermined message based on the received voice data;
setting a second section based on the derived first section;
extracting a part of the image data corresponding to the second section;
deriving an image frame satisfying a predetermined criterion from the extracted image data; and
detecting a facial image included in the derived image frame
Face detection method.

According to claim 1,
The step of deriving the first section is,
generating a spectrogram in which the voice data is converted into a frequency domain for each predetermined time unit;
generating a frequency pattern of voice data including the predetermined message;
Selecting a section having the highest similarity to the frequency pattern in the spectrogram as the first section
Face detection method.

3. The method of claim 2,
The step of generating the spectrogram comprises:
generating a first spectrum obtained by converting first voice data corresponding to a first window set in the predetermined time unit into a frequency domain;
generating a second spectrum obtained by converting second voice data corresponding to a second window different from the first window, which is set in the predetermined time unit, into a frequency domain;
merging the first spectrum and the second spectrum to generate the spectrogram
Face detection method.

4. The method of claim 3,
The first window and the second window are partially overlapped in the time domain of the voice data.
Face detection method.

According to claim 1,
The step of deriving the first section is,
sampling the voice data in a section of a predetermined time unit;
generating a voice pattern including a predetermined message;
extracting voice similarity for each section based on the sampled voice data for each section and the voice pattern using a deep learning module;
selecting a section in which the voice similarity is higher than a predetermined reference value as the first section
Face detection method.

6. The method of claim 5,
The deep learning module,
an input layer using the sampled voice data for each section and the voice pattern as input nodes;
an output layer using the speech similarity as an output node;
one or more hidden layers disposed between the input layer and the output layer;
The weights of nodes and edges between the input node and the output node are updated by the learning process of the deep learning module.
Face detection method.

According to claim 1,
The second section is located at a lower priority in time series than the first section in the voice data.
Face detection method.

8. The method of claim 7,
A part of the second section overlaps the first section
Face detection method.

According to claim 1,
The step of deriving the image frame comprises:
For the second section, one or more frames are derived using a predetermined period, or
In the second section, including deriving a frame in which an optical flow of each frame is smaller than a reference value
Face detection method.

According to claim 1,
Detecting the face image comprises:
Derive a facial landmark for each of the derived frames,
Perform correction for facial alignment based on the derived landmark,
Including extracting feature points from the corrected image
Face detection method.

In the face detection method performed in the server associated with the user terminal,
receiving video data and audio data from the user terminal;
deriving a section related to a predetermined message based on the received voice data;
extracting a part of the image data in a predetermined range based on the derived section;
deriving an image frame satisfying a predetermined criterion from the extracted image data; and
detecting a facial image included in the derived image frame
Face detection method.

12. The method of claim 11,
The step of deriving the section is
generating a spectrogram in which the voice data is converted into a frequency domain for each predetermined time unit;
generating a frequency pattern of voice data including the predetermined message;
Selecting a section having the highest similarity to the frequency pattern in the spectrogram as the section
Face detection method.

12. The method of claim 11,
The step of deriving the section is
sampling the voice data in a section of a predetermined time unit;
generating a voice pattern including a predetermined message;
extracting voice similarity for each section based on the sampled voice data for each section and the voice pattern using a deep learning module;
selecting a section in which the voice similarity is higher than a predetermined reference value as the section
Face detection method.