KR20240070392A

KR20240070392A - Electronic device for processing image and method for operating the same

Info

Publication number: KR20240070392A
Application number: KR1020230115612A
Authority: KR
Inventors: 한승주; 후이 리
Original assignee: 삼성전자주식회사
Priority date: 2022-11-14
Filing date: 2023-08-31
Publication date: 2024-05-21
Also published as: CN115909447A

Abstract

이미지를 처리하는 전자 장치 및 그 동작 방법이 개시된다.　 전자 장치의 동작 방법은 얼굴 이미지의 멀티 레벨 특징 맵을 기반으로 상기 얼굴 이미지의 초기 이미지 특징 행렬을 획득하는 동작, 상기 멀티 레벨 특징 맵의 마지막 레벨 특징 맵을 기반으로 상기 얼굴 이미지의 초기 얼굴 선험 특징 행렬을 획득하는 동작 및 상기 초기 이미지 특징 행렬 및 상기 초기 얼굴 선험 특징 행렬을 기반으로, 캐스케이드된 하나 이상의 인코더들을 이용하여, 상기 얼굴 이미지의 초해상도 이미지 및/또는 상기 얼굴 이미지의 키 포인트 좌표를 획득하는 동작을 포함한다.An electronic device that processes images and a method of operating the same are disclosed. The operating method of the electronic device includes acquiring an initial image feature matrix of the face image based on a multi-level feature map of the face image, and initial facial a priori features of the face image based on the last level feature map of the multi-level feature map. Obtaining a matrix and based on the initial image feature matrix and the initial facial a priori feature matrix, using one or more cascaded encoders, obtain a super-resolution image of the face image and/or key point coordinates of the face image Includes actions.

Description

Electronic device for processing images and method of operating the same {ELECTRONIC DEVICE FOR PROCESSING IMAGE AND METHOD FOR OPERATING THE SAME}

아래의 개시는 이미지를 처리하는 전자 장치 및 그 동작 방법에 관한 것이다.The disclosure below relates to electronic devices that process images and methods of operating the same.

최근 심층 신경망 기술의 발달로 FSR(face super-resolution) 기술이 크게 발전하였다.　 FSR은 주로 CNN(convolutional neural network), GAN(generative adversarial network), 앙상블 학습(ensemble learning) 또는 강화 학습(reinforcement learning)에 기반하여 수행될 수 있다.　 FSR의 성능을 향상시키기 위해서 복잡한 네트워크 구조 설계가 요구될 수 있다.　 그러나, 네트워크 구조가 복잡해진다는 것은 메모리 크기, 연산량 및 파라미터의 증가로 이어져 네트워크의 트레이닝 시간과 연산 비용을 증가시킬 수 있다.　 또한, 얼굴 선험 정보(face prior information)를 활용하여 FSR 성능이 향상될 수 있지만, 얼굴 선험 정보를 활용하는 FSR 방법에는 추가적인 얼굴 선험 정보의 라벨링이 요구될 수 있다.Recently, with the development of deep neural network technology, face super-resolution (FSR) technology has advanced significantly. FSR can mainly be performed based on convolutional neural network (CNN), generative adversarial network (GAN), ensemble learning, or reinforcement learning. To improve the performance of FSR, complex network structure design may be required. However, as the network structure becomes more complex, it can lead to increases in memory size, amount of computation, and parameters, which can increase the training time and computation cost of the network. Additionally, FSR performance can be improved by utilizing face prior information, but labeling of additional face prior information may be required in the FSR method utilizing face prior information.

일 실시예에 따른 전자 장치의 동작 방법은 얼굴 이미지의 멀티 레벨 특징 맵을 기반으로 상기 얼굴 이미지의 초기 이미지 특징 행렬을 획득하는 동작, 상기 멀티 레벨 특징 맵의 마지막 레벨 특징 맵을 기반으로 상기 얼굴 이미지의 초기 얼굴 선험 특징 행렬을 획득하는 동작 및 상기 초기 이미지 특징 행렬 및 상기 초기 얼굴 선험 특징 행렬을 기반으로, 캐스케이드된 하나 이상의 인코더들을 이용하여, 상기 얼굴 이미지의 초해상도 이미지 및/또는 상기 얼굴 이미지의 키 포인트 좌표를 획득하는 동작을 포함한다.A method of operating an electronic device according to an embodiment includes acquiring an initial image feature matrix of the face image based on a multi-level feature map of the face image, and obtaining the face image based on a last-level feature map of the multi-level feature map. Obtaining an initial facial a priori feature matrix and based on the initial image feature matrix and the initial facial a priori feature matrix, using one or more cascaded encoders, a super-resolution image of the facial image and/or a super-resolution image of the facial image Includes an operation to obtain key point coordinates.

상기 얼굴 이미지의 초해상도 이미지를 획득하는 동작은 상기 하나 이상의 인코더들에 포함된 교차 어텐션 모듈을 이용하여, 상기 초기 이미지 특징 행렬 및 상기 초기 얼굴 선험 특징 행렬을 기반으로, 융합된 이미지 특징 행렬을 획득하는 동작, 상기 하나 이상의 인코더들에 포함된 제1 변형 가능한 어텐션 모듈을 이용하여, 상기 융합된 이미지 특징 행렬을 기반으로 상기 얼굴 이미지의 업데이트된 이미지 특징 행렬을 획득하는 동작 및 상기 업데이트된 이미지 특징 행렬 및 상기 얼굴 이미지를 기반으로 상기 얼굴 이미지의 초해상도 이미지를 획득하는 동작을 포함할 수 있다.The operation of acquiring a super-resolution image of the face image involves obtaining a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in the one or more encoders. An operation of obtaining an updated image feature matrix of the face image based on the fused image feature matrix using a first deformable attention module included in the one or more encoders, and the updated image feature matrix And it may include an operation of acquiring a super-resolution image of the face image based on the face image.

상기 얼굴 이미지의 키 포인트 좌표를 획득하는 동작은 상기 하나 이상의 인코더들에 포함된 교차 어텐션 모듈을 이용하여, 상기 초기 이미지 특징 행렬 및 상기 초기 얼굴 선험 특징 행렬을 기반으로, 융합된 이미지 특징 행렬을 획득하는 동작, 상기 하나 이상의 인코더들에 포함된 제2 변형 가능한 어텐션 모듈을 이용하여, 상기 융합된 이미지 특징 행렬 및 상기 초기 얼굴 선험 특징 행렬을 기반으로, 상기 얼굴 이미지의 업데이트된 얼굴 선험 특징을 획득하는 동작, 상기 업데이트된 얼굴 선험 특징 및 상기 얼굴 이미지의 초기 키 포인트 좌표를 기반으로, 상기 얼굴 이미지의 키 포인트 좌표를 예측하는 동작 - 상기 얼굴 이미지의 초기 키 포인트 좌표는 상기 초기 얼굴 선험 특징 행렬을 기반으로 획득됨 -을 포함할 수 있다.The operation of acquiring the key point coordinates of the face image involves obtaining a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in the one or more encoders. The operation of obtaining updated facial a priori features of the facial image based on the fused image feature matrix and the initial facial a priori feature matrix using a second deformable attention module included in the one or more encoders. An operation of predicting key point coordinates of the face image based on the updated facial a priori features and initial key point coordinates of the facial image, wherein the initial key point coordinates of the facial image are based on the initial facial a priori feature matrix. Obtained by - may include.

상기 하나 이상의 인코더들의 각각은 제1 네트워크, 제2 네트워크 및 제3 네트워크를 포함하고, 제1 네트워크는 교차 어텐션 모듈을 포함하고, 제2 네트워크는 제1 변형 가능한 어텐션 모듈을 포함하고, 제3 네트워크는 제2 변형 가능한 어텐션 모듈을 포함하고, 상기 얼굴 이미지의 초해상도 이미지 및/또는 상기 얼굴 이미지의 키 포인트 좌표를 획득하는 동작은 각각의 인코더에 대해, 현재 인코더에 대응하는 이미지 특징 행렬 및 얼굴 선험 특징 행렬에 기초하여, 제1 네트워크를 이용하여 현재 인코더의 융합된 이미지 특징 행렬을 획득하는 동작, 상기 현재 인코더의 융합된 이미지 특징 행렬 및 현재 인코더에 대응하는 얼굴 선험 특징 행렬에 기초하여, 제2 네트워크를 이용하여, 상기 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작, 상기 현재 인코더의 융합된 이미지 특징 행렬을 기반으로, 제3 네트워크를 이용하여, 상기 현재 인코더의 업데이트된 이미지 특징 행렬을 획득하는 동작 및 상기 하나 이상의 인코더들의 마지막 인코더의 업데이트된 이미지 특징 행렬 및 상기 얼굴 이미지를 기반으로, 상기 얼굴 이미지의 초해상도 이미지를 획득하는 동작 및/또는 마지막 인코더의 업데이트된 얼굴 선험 특징 행렬 및 상기 얼굴 이미지의 초기 키 포인트 좌표에 기초하여, 상기 얼굴 이미지의 키 포인트 좌표를 예측하는 동작을 포함하고, 상기 현재 인코더가 하나 이상의 인코더 중 첫 번째 인코더인 경우, 상기 현재 인코더에 대응하는 이미지 특징 행렬은 상기 초기 이미지 특징 행렬이고, 상기 현재 인코더에 대응하는 얼굴 선험 특징 행렬은 상기 초기 얼굴 선험 특징 행렬이며, 상기 현재 인코더가 첫 번째 인코더가 아닌 경우, 상기 현재 인코더에 대응하는 이미지 특징 행렬은 상기 현재 인코더의 이전 인코더의 업데이트된 이미지 특징 행렬이고, 상기 현재 인코더에 대응하는 얼굴 선험 특징 행렬은 상기 현재 인코더의 이전 인코더의 업데이트된 얼굴 선험 특징 행렬일 수 있다.Each of the one or more encoders includes a first network, a second network, and a third network, the first network including a cross attention module, the second network including a first deformable attention module, and the third network includes a second deformable attention module, and the operation of obtaining a super-resolution image of the face image and/or key point coordinates of the face image includes, for each encoder, an image feature matrix corresponding to the current encoder and a face a priori. Obtaining, based on the feature matrix, a fused image feature matrix of the current encoder using a first network, based on the fused image feature matrix of the current encoder and a facial a priori feature matrix corresponding to the current encoder, a second operation; Obtaining, using a network, an updated facial a priori feature matrix of the current encoder, based on the fused image feature matrix of the current encoder, using a third network, obtaining an updated image feature matrix of the current encoder Obtaining and/or obtaining a super-resolution image of the face image based on the face image and the updated image feature matrix of the last encoder of the one or more encoders and/or the updated face a priori feature matrix of the last encoder and An operation of predicting key point coordinates of the face image based on the initial key point coordinates of the face image, and when the current encoder is a first encoder among one or more encoders, the image feature matrix corresponding to the current encoder is is the initial image feature matrix, and the facial a priori feature matrix corresponding to the current encoder is the initial facial a priori feature matrix, and if the current encoder is not the first encoder, the image feature matrix corresponding to the current encoder is the current encoder is the updated image feature matrix of the previous encoder, and the facial a priori feature matrix corresponding to the current encoder may be the updated facial a priori feature matrix of the previous encoder of the current encoder.

상기 얼굴 이미지의 초해상도 이미지 및/또는 상기 얼굴 이미지의 키 포인트 좌표를 획득하는 동작은, 캐스케이드된 마지막 인코더에 대응하는 업데이트된 이미지 특징 행렬을 기반으로, 업샘플링 증폭 네트워크를 이용하여 제1 오프셋을 구하고, 제1 오프셋과 상기 얼굴 이미지를 기반으로 상기 초해상도 이미지를 획득하는 동작 및/또는 상기 캐스케이드된 마지막 인코더에 대응하는 업데이트된 얼굴 선험 특징 행렬을 기반으로, 키 포인트 예측 네트워크를 이용하여 제2 오프셋을 구하고, 제2 오프셋과 상기 얼굴 이미지의 초기 키 포인트 좌표를 기반으로 예측된 상기 얼굴 이미지의 키 포인트 좌표를 획득하는 동작을 포함할 수 있다.The operation of acquiring the super-resolution image of the face image and/or the key point coordinates of the face image is based on the updated image feature matrix corresponding to the last encoder cascaded, using an upsampling amplification network to generate a first offset. Obtaining the super-resolution image based on the first offset and the facial image and/or based on the updated facial a priori feature matrix corresponding to the cascaded last encoder, using a key point prediction network It may include obtaining an offset and obtaining key point coordinates of the face image predicted based on the second offset and the initial key point coordinates of the face image.

각 인코더의 제1 네트워크는 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 더 포함하고, 상기 현재 인코더의 융합된 이미지 특징 행렬을 획득하는 동작은 위치 정보가 내장된 현재 인코더에 상응하는 이미지 특징 행렬, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬 및 현재 인코더에 상응하는 얼굴 선험 특징 행렬을 각각 쿼리 벡터, 키 벡터 및 값 벡터로서 교차 어텐션 모듈에 입력하여, 캐스케이드된 교차 어텐션 모듈, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 통해 현재 인코더의 융합된 이미지 특징 행렬을 획득하는 동작을 포함할 수 있다.The first network of each encoder further includes a layer normalization layer and a feedforward network layer, and the operation of obtaining the fused image feature matrix of the current encoder includes an image feature matrix corresponding to the current encoder with location information embedded, and location information. Input the facial a priori feature matrix corresponding to the current encoder embedded and the facial a priori feature matrix corresponding to the current encoder as the query vector, key vector, and value vector, respectively, into the cross attention module, and the cascaded cross attention module, layer normalization layer, and It may include an operation of acquiring the fused image feature matrix of the current encoder through a feedforward network layer.

상기 현재 인코더의 융합된 이미지 특징 행렬을 기반으로 제3 네트워크를 이용하여 현재 인코더의 업데이트된 이미지 특징 행렬을 획득하는 동작은 상기 현재 인코더의 융합된 이미지 특징 행렬 중 각 특징의 정규화 위치를 결정하는 동작 - 상기 정규화 위치는 상기 대응하는 특징 맵 내 각 특징에 대응하는 상기 특징 맵 내 특징의 정규화 위치를 나타냄 -, 상기 멀티 레벨 특징 맵의 각 특징 맵에서 미리 설정된 규칙에 따라 각 특징의 상기 정규화 위치 근처에서 K개의 정규화 위치를 결정하는 동작, 현재 인코더의 융합된 이미지 특징 행렬에서 멀티 레벨 특징 맵의 각 특징 맵의 K개의 정규화 위치에 상응하는 L*K개의 특징에 대해 가중 합산을 수행하여 현재 인코더의 융합된 이미지 특징 행렬 중 상기 각 특징에 대응하는 특징을 현재 인코더의 업데이트된 이미지 특징 행렬 중의 특징으로 획득하는 동작 - 상기 L은 멀티 레벨 특징 맵의 특징 맵 개수임 -을 포함할 수 있다.The operation of obtaining an updated image feature matrix of the current encoder using a third network based on the fused image feature matrix of the current encoder is an operation of determining the normalization position of each feature among the fused image feature matrix of the current encoder. - The normalization position indicates the normalization position of a feature in the feature map corresponding to each feature in the corresponding feature map - Near the normalization position of each feature according to a preset rule in each feature map of the multi-level feature map In the operation of determining K normalization positions, weighted summation is performed on L*K features corresponding to the K normalization positions of each feature map of the multi-level feature map in the fused image feature matrix of the current encoder. It may include an operation of acquiring features corresponding to each feature in the fused image feature matrix as features in the updated image feature matrix of the current encoder, where L is the number of feature maps of the multi-level feature map.

각 인코더의 제2 네트워크는 셀프 어텐션 모듈을 더 포함하고, 상기 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작은 상기 현재 인코더에 상응하는 얼굴 선험 특징 행렬에 기초하여, 셀프 어텐션 모듈을 이용하여 현재 인코더에 상응하는 셀프 어텐션 얼굴 선험 특징 행렬을 획득하는 동작 및 상기 현재 인코더에 상응하는 셀프 어텐션 얼굴 선험 특징 행렬 및 현재 인코더의 융합된 이미지 특징 행렬에 기초하여, 제1 변형 가능한 어텐션 모듈을 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작을 포함할 수 있다.The second network of each encoder further includes a self-attention module, and the operation of acquiring the updated facial a priori feature matrix of the current encoder is based on the facial a priori feature matrix corresponding to the current encoder, using the self-attention module. Obtaining a self-attention facial a priori feature matrix corresponding to the current encoder and, based on the self-attention facial a priori feature matrix corresponding to the current encoder and the fused image feature matrix of the current encoder, using a first deformable attention module. The operation may include obtaining an updated facial a priori feature matrix of the current encoder.

상기 셀프 어텐션 모듈은 캐스케이드된 셀프 어텐션 레이어, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 포함하고, 상기 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬을 획득하는 동작은 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬 및 현재 인코더에 상응하는 얼굴 선험 특징 행렬을 각각 쿼리 행렬, 키 행렬 및 값 행렬로서 셀프 어텐션 레이어에 입력하여, 캐스케이드된 셀프 어텐션 레이어, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 통해 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬을 획득하는 동작을 포함하고, 상기 제2 변형 가능한 어텐션 모듈을 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작은 상기 마지막 레벨 특징 맵에서 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬 중 각 특징의 정규화 위치를 결정하는 동작 - 상기 정규화 위치는 상기 마지막 레벨 특징 맵에서 각 특징에 상응하는 마지막 레벨 특징 맵에서의 특징의 정규화 위치를 나타냄 -, 미리 설정된 규칙에 따라 최종 레벨 특징 맵에서 상기 정규화 위치 근처의 K개의 정규화 위치를 결정하는 동작 및 상기 현재 인코더의 업데이트된 이미지 특징 행렬에서 상기 K개의 정규화 위치에 상응하는 K개의 특징을 결정하고, 상기 K개의 특징에 가중치를 합산하여 셀프 어텐션 얼굴 선험 특징 행렬 중 상기 각 특징에 상응하는 특징을 현재 인코더의 업데이트된 얼굴 선험 특징 행렬의 특징으로서 획득하는 동작을 포함할 수 있다.The self-attention module includes a cascaded self-attention layer, a layer normalization layer, and a feedforward network layer, and the operation of acquiring the self-attention face a priori feature matrix of the current encoder is a face a priori feature matrix corresponding to the current encoder with embedded location information. The feature matrix, the facial a priori feature matrix corresponding to the current encoder with embedded position information, and the facial a priori feature matrix corresponding to the current encoder are input into the self-attention layer as the query matrix, key matrix, and value matrix, respectively, and a cascaded self-attention layer is formed. , Obtaining a self-attention facial a priori feature matrix of the current encoder through a layer normalization layer and a feedforward network layer, and obtaining an updated facial a priori feature matrix of the current encoder using the second deformable attention module. The operation is to determine the normalization position of each feature in the self-attention face a priori feature matrix of the current encoder in the last level feature map - the normalization position is the feature in the last level feature map corresponding to each feature in the last level feature map. represents the normalization positions of -, an operation of determining K normalization positions near the normalization positions in a final level feature map according to a preset rule, and K corresponding to the K normalization positions in the updated image feature matrix of the current encoder. It may include determining K features, adding weights to the K features, and obtaining features corresponding to each feature in the self-attention facial a priori feature matrix as features of the updated facial a priori feature matrix of the current encoder. .

일 실시예에 따른 전자 장치는 프로세서 및 인스트럭션들을 저장한 메모리를 포함하고, 상기 인스트럭션들이 상기 프로세서에 의해 실행될 때, 상기 전자 장치로 하여금 제1항 내지 제9항 중 어느 한 항의 방법을 실행하도록 한다.An electronic device according to an embodiment includes a processor and a memory storing instructions, and when the instructions are executed by the processor, cause the electronic device to execute the method of any one of claims 1 to 9. .

일 실시예에 따른 전자 장치는 얼굴 이미지의 멀티 레벨 특징 맵을 기반으로 상기 얼굴 이미지의 초기 이미지 특징 행렬을 획득하도록 구성된 제1 획득 유닛, 상기 멀티 레벨 특징 맵의 마지막 레벨 특징 맵을 기반으로 상기 얼굴 이미지의 초기 얼굴 선험(a priori) 특징 행렬을 획득하도록 구성된 제2 획득 유닛 및 상기 초기 이미지 특징 행렬 및 상기 초기 얼굴 선험 특징 행렬을 기반으로 캐스케이드된 하나 이상의 인코더를 이용하여 상기 얼굴 이미지의 초해상도 이미지 및/또는 상기 얼굴 이미지의 키 포인트 좌표를 획득하도록 구성된 제3 획득 유닛을 포함한다.The electronic device according to one embodiment includes a first acquisition unit configured to acquire an initial image feature matrix of the face image based on a multi-level feature map of the face image, and a first acquisition unit configured to acquire an initial image feature matrix of the face image based on a last-level feature map of the multi-level feature map. a super-resolution image of the facial image using a second acquisition unit configured to acquire an initial facial a priori feature matrix of the image and one or more encoders cascaded based on the initial image feature matrix and the initial facial a priori feature matrix and/or a third acquisition unit configured to acquire key point coordinates of the face image.

도 1은 일 실시예에 따른 전자 장치의 동작 방법을 나타낸 도면이다.
도 2는 일 실시예에 따른 초기 이미지 특징 행렬(initial image feature matrix) 및 초기 얼굴 선험 특징 행렬을 획득하는 동작을 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 하나 이상의 인코더들 중 하나의 제1 네트워크 레이어의 구조를 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 변형 가능한 어텐션 메커니즘(deformable attention mechanism)에 기반하여 이미지 특징 행렬을 업데이트하는 동작을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 제2 네트워크 레이어의 구조를 설명하기 위한 도면이다.
도 6은 일 실시예에 따른 얼굴 이미지의 초해상도 이미지(super-resolution image) 및/또는 얼굴 이미지의 키 포인트 좌표(key point coordinate)를 획득하는 동작을 설명하기 위한 도면이다.
도 7은 일 실시예에 따른 전자 장치를 나타낸 도면이다.1 is a diagram illustrating a method of operating an electronic device according to an embodiment.
FIG. 2 is a diagram illustrating an operation of acquiring an initial image feature matrix and an initial facial a priori feature matrix according to an embodiment.
FIG. 3 is a diagram illustrating the structure of a first network layer among one or more encoders according to an embodiment.
FIG. 4 is a diagram illustrating an operation of updating an image feature matrix based on a deformable attention mechanism according to an embodiment.
Figure 5 is a diagram for explaining the structure of a second network layer according to an embodiment.
FIG. 6 is a diagram for explaining an operation of acquiring a super-resolution image of a face image and/or key point coordinates of a face image, according to an embodiment.
Figure 7 is a diagram showing an electronic device according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다.　 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific disclosed embodiments, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

본 문서에서, "A 또는 B", "A 및 B 중 적어도 하나", "A 또는 B 중 적어도 하나", "A, B 또는 C", "A, B 및 C 중 적어도 하나", "A, B, 또는 C 중 적어도 하나", 및 "A, B 및 C 중 하나 또는 둘 이상의 조합"과 같은 문구들 각각은 그 문구들 중 해당하는 문구에 함께 나열된 항목들 중 어느 하나, 또는 그들의 모든 가능한 조합을 포함할 수 있다. 　제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다.　 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.In this document, “A or B”, “at least one of A and B”, “at least one of A or B”, “A, B or C”, “at least one of A, B and C”, “A, Phrases such as “at least one of B, or C,” and “one or a combination of two or more of A, B, and C” each refer to any one of the items listed together with that phrase, or any possible combination thereof. may include. Terms such as first or second may be used to describe various components, but these terms should be interpreted only for the purpose of distinguishing one component from another component. For example, a first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected or connected to the other component, but that other components may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.　 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of the described features, numbers, steps, operations, components, parts, or combinations thereof, and are intended to indicate the presence of one or more other features or numbers, It should be understood that this does not exclude in advance the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다.　 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art. Terms as defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in this specification, should not be interpreted in an idealized or overly formal sense. No.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다.　 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일 실시예에 따른 전자 장치의 동작 방법을 나타낸 도면이다.1 is a diagram illustrating a method of operating an electronic device according to an embodiment.

이하 실시예에서 각 동작들은 순차적으로 수행될 수도 있으나, 반드시 순차적으로 수행되는 것은 아니다.　 예를 들어, 각 동작들의 순서가 변경될 수도 있으며, 적어도 두 동작들이 병렬적으로 수행될 수도 있다.　 동작들(101~103)은 전자 장치의 적어도 하나의 구성요소(예: 호스트 프로세서, 가속기, 메모리 등)에 의해 수행될 수 있다.In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel. Operations 101 to 103 may be performed by at least one component (eg, host processor, accelerator, memory, etc.) of the electronic device.

전자 장치는 이미지를 처리하는 장치로서, 예를 들어, 휴대폰, 스마트 폰, 태블릿, 전자북 장치, 랩탑, 퍼스널 컴퓨터, 데스크탑, 워크스테이션 또는 서버와 같은 다양한 컴퓨팅 장치, 스마트 시계, 스마트 안경, HMD(Head-Mounted Display), 또는 스마트 의류와 같은 다양한 웨어러블 기기, 스마트 스피커, 스마트 TV, 또는 스마트 냉장고와 같은 다양한 가전장치, 스마트 자동차, 스마트 키오스크, IoT(Internet of Things) 기기, WAD(Walking Assist Device), 드론, 또는 로봇을 포함할 수 있으나, 전술한 예에 한정되지 않는다.　 이미지는 얼굴을 포함할 수 있으나, 전술한 예에 한정되지 않으며, 실시예에 따라서 다양한 객체가 포함될 수 있다.　 본 명세서에서 설명의 편의를 위해, 전자 장치는 이미지 처리 장치 또는 얼굴 이미지 처리 장치로도 지칭될 수 있다.Electronic devices are devices that process images, for example, various computing devices such as mobile phones, smart phones, tablets, e-book devices, laptops, personal computers, desktops, workstations or servers, smart watches, smart glasses, HMDs ( Head-Mounted Display), or various wearable devices such as smart clothing, various home appliances such as smart speakers, smart TVs, or smart refrigerators, smart cars, smart kiosks, IoT (Internet of Things) devices, and WAD (Walking Assist Device) , drones, or robots, but is not limited to the examples described above. The image may include a face, but is not limited to the above-described example, and may include various objects depending on the embodiment. For convenience of explanation herein, the electronic device may also be referred to as an image processing device or a facial image processing device.

동작(101)에서, 전자 장치는 얼굴 이미지의 멀티 레벨 특징 맵에 기초하여 얼굴 이미지의 초기 이미지 특징 행렬을 획득한다.　 예를 들어, 전자 장치는 얼굴 이미지의 멀티 레벨 특징 맵을 평탄화 및 캐스케이딩(cascading)함으로써 얼굴 이미지의 초기 이미지 특징 행렬을 획득할 수 있다.In operation 101, the electronic device obtains an initial image feature matrix of the face image based on the multi-level feature map of the face image. For example, the electronic device may obtain the initial image feature matrix of the face image by flattening and cascading the multi-level feature map of the face image.

동작(102)에서, 전자 장치는 멀티 레벨 특징 맵의 마지막 레벨 특징 맵에 기반하여 얼굴 이미지의 초기 얼굴 선험 특징 행렬을 획득한다.　 예를 들어, 전자 장치는 멀티 레벨 특징 맵의 마지막 레벨 특징 맵을 기반으로 완전 연결 네트워크(fully connected network)를 이용하여, 얼굴 이미지의 초기 얼굴 선험 특징 행렬을 획득할 수 있다.　 멀테 레벨 특징 맵 내 각 레벨 특징 맵은 동일한 채널 수를 가질 수 있다.In operation 102, the electronic device obtains an initial facial a priori feature matrix of the facial image based on the last level feature map of the multi-level feature map. For example, the electronic device may acquire the initial facial a priori feature matrix of the face image using a fully connected network based on the last level feature map of the multi-level feature map. Each level feature map within the multi-level feature map may have the same number of channels.

이하, 도 2를 통해 전자 장치의 동작에 대해 상세히 설명한다.Hereinafter, the operation of the electronic device will be described in detail with reference to FIG. 2.

도 2는 일 실시예에 따른 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 획득하는 동작을 설명하기 위한 도면이다.FIG. 2 is a diagram illustrating an operation of acquiring an initial image feature matrix and an initial facial a priori feature matrix according to an embodiment.

도 2를 참조하면, 전자 장치는 트레이닝된 컨볼루션 신경망(예: ResNet18)을 통해 입력된 얼굴 이미지의 4 레벨의 피라미드 특징 맵 F1, F2, F3, F4를 추출할 수 있다.　 전자 장치는 1*1 컨볼루션 네트워크를 통해 각 특징 맵의 특징이 동일한 채널 수를 갖도록 투영하여 특징F1', F2', F3', F4'를 획득할 수 있다.　 전자 장치는 획득한 4 레벨의 특징 맵F1', F2', F3', F4'를 평탄화 및 캐스케이딩함으로써, 얼굴 이미지의 초기 이미지 특징 행렬 을 획득할 수 있다.　 그중, 는 초기 이미지 특징 행렬의 i번째 행과 m번째 열의 특징을 나타내고, M은 초기 이미지 특징 행렬의 행 수를 나타내고, C는 초기 이미지 특징 행렬의 열 수를 나타낼 수 있다.　 예를 들어, C는 동일한 수의 채널(예: C=256)을 나타낼 수 있다.　 설명의 편의를 위해, 는 초기 이미지 특징 행렬을 나타낼 수 있다.Referring to FIG. 2, the electronic device can extract four-level pyramid feature maps F1, F2, F3, and F4 of the input face image through a trained convolutional neural network (e.g., ResNet18). The electronic device can obtain features F1', F2', F3', and F4' by projecting the features of each feature map to have the same number of channels through a 1*1 convolutional network. The electronic device flattens and cascades the acquired four-level feature maps F1', F2', F3', and F4' to create an initial image feature matrix of the face image. can be obtained. among them, may represent the features of the ith row and mth column of the initial image feature matrix, M may represent the number of rows of the initial image feature matrix, and C may represent the number of columns of the initial image feature matrix. For example, C may represent the same number of channels (e.g., C=256). For convenience of explanation, may represent the initial image feature matrix.

본 명세서에서는 설명의 편의를 위해, 얼굴 이미지는 LR 이미지(low resolution image)로도 지칭될 수 있고, 맵은 행렬로도 지칭될 수 있다.　 또한, 설명의 편의를 위해, 이하에서는 LR 이미지에서 4개 레벨의 특징 맵이 추출되는 예시를 기준으로 설명한다.In this specification, for convenience of explanation, a face image may also be referred to as an LR image (low resolution image), and a map may also be referred to as a matrix. Additionally, for convenience of explanation, the following description is based on an example in which four levels of feature maps are extracted from an LR image.

예를 들어, 얼굴 선험 특징 행렬 은 F4'에서 공간 차원(spatial dimension) 상의 선형 투영(linear projection)을 통해 획득할 수 있고, N은 얼굴 선험 특징 행렬의 특징 수를 나타낸다.For example, a facial a priori feature matrix can be obtained through linear projection on the spatial dimension in F4', and N represents the number of features in the facial a priori feature matrix.

예를 들어, LR 이미지의 초기 얼굴 선험 특징 행렬은 아래의 수학식 1에 따라 획득될 수 있다.For example, the initial facial a priori feature matrix of the LR image can be obtained according to Equation 1 below.

위의 수학식 1에서, 는 완전 연결 동작(fully connected operation)을 나타낼 수 있다.In equation 1 above, may represent a fully connected operation.

동작(103)에서, 전자 장치는 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로, 캐스케이드된 하나 이상의 인코더들을 이용하여, 얼굴 이미지의 초해상도 이미지 및/또는 얼굴 이미지의 키 포인트 좌표(key point coordinate)를 획득한다.In operation 103, the electronic device generates a super-resolution image of the face image and/or key point coordinates of the face image using one or more cascaded encoders, based on the initial image feature matrix and the initial facial a priori feature matrix. coordinate).

예를 들어, 얼굴 이미지의 초해상도 이미지를 획득하는 동작은, 하나 이상의 인코더에 포함된 교차 어텐션 모듈(cross-attention module)을 이용하여, 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 융합된 이미지 특징 행렬(fused image feature matrix)을 획득하는 동작, 하나 이상의 인코더에 포함된 제1 변형 가능한 어텐션 모듈을 이용하여, 융합된 이미지 특징 행렬을 기반으로 얼굴 이미지의 업데이트된 이미지 특징 행렬을 획득하는 동작, 상기 업데이트된 이미지 특징 행렬 및 얼굴 이미지를 기반으로 얼굴 이미지의 초해상도 이미지를 획득하는 동작을 포함할 수 있다.For example, the operation of acquiring a super-resolution image of a face image is fused based on the initial image feature matrix and the initial facial a priori feature matrix using a cross-attention module included in one or more encoders. An operation of acquiring a fused image feature matrix, an operation of obtaining an updated image feature matrix of a face image based on the fused image feature matrix using a first deformable attention module included in one or more encoders. , It may include an operation of acquiring a super-resolution image of the face image based on the updated image feature matrix and the face image.

본 명세서에서는 설명의 편의를 위해, 모듈은 레이어로도 지칭될 수 있다.　 예를 들어, 교차 어텐션 모듈은 교차 어텐션 레이어로도 지칭될 수 있고, 레이어 정규화 레이어는 정규화 모듈로도 지칭될 수 있다.In this specification, for convenience of explanation, a module may also be referred to as a layer. For example, a cross attention module may also be referred to as a cross attention layer, and a layer normalization layer may also be referred to as a normalization module.

예를 들어, 　얼굴 이미지의 키 포인트 좌표를 획득하는 동작은, 　하나 이상의 인코더에 포함된 교차 어텐션 모듈을 이용하여, 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 융합된 이미지 특징 행렬을 획득하는 동작, 하나 이상의 인코더에 포함된 제2 변형 가능한 어텐션 모듈을 이용하여, 융합된 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 얼굴 이미지의 업데이트된 얼굴 선험 특징을 획득하는 동작, 업데이트된 얼굴 선험 특징 및 얼굴 이미지의 초기 키 포인트 좌표를 기반으로, 얼굴 이미지의 키 포인트 좌표를 예측하는 동작을 포함할 수 있다.　 이때, 얼굴 이미지의 초기 키 포인트 좌표는 초기 얼굴 선험 특징 행렬을 기반으로 획득될 수 있다 　예를 들어, 전자 장치는 얼굴 이미지의 초기 키 포인트 좌표를 얻기 위해 초기 얼굴 선험 특징 행렬에 대해 전체 연결을 수행할 수 있다.For example, the operation of acquiring the key point coordinates of a face image involves obtaining a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in one or more encoders. An operation, using a second deformable attention module included in the one or more encoders, to obtain updated facial a priori features of the face image based on the fused image feature matrix and the initial facial a priori feature matrix, the updated facial a priori features. and predicting key point coordinates of the face image based on the initial key point coordinates of the face image. At this time, the initial key point coordinates of the face image can be obtained based on the initial facial a priori feature matrix. For example, the electronic device performs full concatenation on the initial facial a priori feature matrix to obtain the initial key point coordinates of the face image. can do.

예를 들어, 하나 이상의 인코더의 각각의 인코더는 제1 네트워크, 제2 네트워크 및 제3 네트워크를 포함하고, 제1 네트워크는 교차 어텐션 모듈을 포함하고, 제2 네트워크는 제1 변형 가능한 어텐션 모듈을 포함하고, 제3 네트워크는 제2 변형 가능한 어텐션 모듈을 포함하고, 이때, 얼굴 이미지의 초해상도 이미지 및/또는 얼굴 이미지의 키 포인트 좌표를 획득하는 동작은, 각각의 인코더에 대해, 현재 인코더에 대응하는 이미지 특징 행렬 및 얼굴 선험 특징 행렬에 기초하여, 제1 네트워크를 이용하여 현재 인코더의 융합된 이미지 특징 행렬을 획득하는 동작, 현재 인코더의 융합된 이미지 특징 행렬 및 현재 인코더에 대응하는 얼굴 선험 특징 행렬에 기초하여, 제2 네트워크를 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작, 현재 인코더의 융합된 이미지 특징 행렬을 기반으로, 제3 네트워크를 이용하여 현재 인코더의 업데이트된 이미지 특징 행렬을 획득하는 동작, 하나 이상의 인코더의 마지막 인코더의 업데이트된 이미지 특징 행렬 및 얼굴 이미지를 기반으로, 얼굴 이미지의 초해상도 이미지를 획득하는 동작, 및/또는 마지막 인코더의 업데이트된 얼굴 선험 특징 행렬 및 얼굴 이미지의 초기 키 포인트 좌표에 기초하여, 얼굴 이미지의 키 포인트 좌표를 예측하는 동작을 포함할 수 있다.　 현재 인코더가 하나 이상의 인코더 중 첫 번째 인코더인 경우, 현재 인코더에 대응하는 이미지 특징 행렬은 초기 이미지 특징 행렬이고, 현재 인코더에 대응하는 얼굴 선험 특징 행렬은 초기 얼굴 선험 특징 행렬일 수 있다.　 현재 인코더가 첫 번째 인코더가 아닌 경우, 현재 인코더에 대응하는 이미지 특징 행렬은 현재 인코더의 이전 인코더의 업데이트된 이미지 특징 행렬이고, 현재 인코더에 대응하는 얼굴 선험 특징 행렬은 현재 인코더의 이전 인코더의 업데이트된 얼굴 선험 특징 행렬일 수 있다.For example, each encoder of the one or more encoders includes a first network, a second network, and a third network, the first network includes a cross attention module, and the second network includes a first deformable attention module. And the third network includes a second deformable attention module, wherein the operation of acquiring the super-resolution image of the face image and/or the key point coordinates of the face image is, for each encoder, An operation of obtaining a fused image feature matrix of the current encoder using a first network, based on the image feature matrix and the face a priori feature matrix, the fused image feature matrix of the current encoder and the face a priori feature matrix corresponding to the current encoder. Based on this, an operation of obtaining an updated facial a priori feature matrix of the current encoder using a second network, and based on the fused image feature matrix of the current encoder, an updated image feature matrix of the current encoder using a third network. Obtaining, based on the face image and the updated image feature matrix of the last encoder of one or more encoders, a super-resolution image of the face image, and/or of the updated face a priori feature matrix of the last encoder and the face image. Based on the initial key point coordinates, the operation of predicting the key point coordinates of the face image may be included. If the current encoder is the first encoder among one or more encoders, the image feature matrix corresponding to the current encoder may be the initial image feature matrix, and the face a priori feature matrix corresponding to the current encoder may be the initial face a priori feature matrix. If the current encoder is not the first encoder, the image feature matrix corresponding to the current encoder is the updated image feature matrix of the previous encoder of the current encoder, and the face a priori feature matrix corresponding to the current encoder is the updated image feature matrix of the previous encoder of the current encoder. It may be a facial a priori feature matrix.

예를 들어, 하나 이상의 인코더는 제1 인코더 및 제2 인코더를 포함할 수 있으나, 전술한 예에 한정되지 않는다.For example, the one or more encoders may include a first encoder and a second encoder, but are not limited to the examples described above.

예를 들어, 전자 장치는 초기 이미지 특징 행렬 M11 및 초기 얼굴 선험 특징 행렬 F11을 기반으로, 제1 인코더의 제1 네트워크를 이용하여 제1 인코더의 융합된 이미지 특징 행렬 M12을 획득할 수 있다.　 전자 장치는 초기 얼굴 선험 특징 행렬 F11 및 제1 인코더의 융합된 이미지 특징 행렬 M12를 기반으로, 제1 인코더의 제2 네트워크를 이용하여, 제1 인코더의 업데이트된 얼굴 선험 특징 행렬 F12을 획득할 수 있다.　 전자 장치는 제1 인코더의 융합된 이미지 특징 행렬 M12를 기반으로, 제1 인코더의 제3 네트워크를 이용하여, 제1 인코더의 업데이트된 이미지 특징 행렬 M13을 획득할 수 있다.　 전자 장치는 제1 인코더의 업데이트된 얼굴 선험 특징 행렬 F12 및 제1 인코더의 업데이트된 이미지 특징 행렬 M13를 기반으로, 제2 인코더의 제1 네트워크를 이용하여, 제2 인코더의 융합된 이미지 특징 행렬 M22을 획득할 수 있다.　 전자 장치는 제2 인코더의 융합된 이미지 특징 행렬 M22 및 제1 인코더의 업데이트된 얼굴 선험 특징 행렬 F12를 기반으로, 제2 인코더의 제2 네트워크를 이용하여, 제2 인코더의 업데이트된 얼굴 선험 특징 행렬 F22을 획득할 수 있다.　 전자 장치는 제2 인코더의 융합된 이미지 특징 행렬 M22를 기반으로, 제2 인코더의 제3 네트워크를 이용하여, 제2 인코더의 업데이트된 이미지 특징 행렬 M23을 획득할 수 있다.For example, the electronic device may obtain the fused image feature matrix M12 of the first encoder using the first network of the first encoder, based on the initial image feature matrix M11 and the initial facial a priori feature matrix F11. The electronic device may obtain an updated facial a priori feature matrix F12 of the first encoder based on the initial facial a priori feature matrix F11 and the fused image feature matrix M12 of the first encoder, using the second network of the first encoder. there is. Based on the fused image feature matrix M12 of the first encoder, the electronic device may obtain the updated image feature matrix M13 of the first encoder using the third network of the first encoder. Based on the updated facial a priori feature matrix F12 of the first encoder and the updated image feature matrix M13 of the first encoder, the electronic device uses the first network of the second encoder to generate a fused image feature matrix M22 of the second encoder. can be obtained. Based on the fused image feature matrix M22 of the second encoder and the updated facial a priori feature matrix F12 of the first encoder, the electronic device uses the second network of the second encoder to update the facial a priori feature matrix of the second encoder. You can obtain F22. Based on the fused image feature matrix M22 of the second encoder, the electronic device may obtain the updated image feature matrix M23 of the second encoder using the third network of the second encoder.

예를 들어, 전자 장치는 제2 인코더의 업데이트된 이미지 특징 행렬 M23 및 얼굴 이미지를 기반으로, 얼굴 이미지의 초해상도 이미지를 획득할 수 있다.For example, the electronic device may acquire a super-resolution image of the face image based on the updated image feature matrix M23 of the second encoder and the face image.

예를 들어, 전자 장치는 제2 인코더의 업데이트된 얼굴 선험 특징 행렬 F22 및 얼굴 이미지의 초기 키 포인트 좌료를 기반으로 얼굴 이미지의 키 포인트 좌표를 예측할 수 있다.For example, the electronic device may predict the key point coordinates of the face image based on the updated facial a priori feature matrix F22 of the second encoder and the initial key point coordinates of the face image.

예를 들어, 　얼굴 이미지의 초해상도 이미지 및/또는 　얼굴 이미지의 키 포인트 좌표를 획득하는 동작은, 캐스케이드된 마지막 인코더에 대응하는 업데이트된 이미지 특징 행렬을 기반으로, 업샘플링 증폭 네트워크(up sampling amplification network)를 이용하여 제1 오프셋을 구하고, 제1 오프셋과 얼굴 이미지를 기반으로 초해상도 이미지를 획득하는 동작 및/또는 캐스케이드된 마지막 인코더에 대응하는 업데이트된 얼굴 선험 특징 행렬을 기반으로, 키 포인트 예측 네트워크를 이용하여 제2 오프셋을 구하고, 제2 오프셋과 얼굴 이미지의 초기 키 포인트 좌표를 기반으로 예측된 얼굴 이미지의 키 포인트 좌표를 획득하는 동작을 포함하고, 이때, 얼굴 이미지의 키 포인트 좌표는 초기 얼굴 선험 특징 행렬에 대해 전체 연결을 수행함으로써 획득될 수 있다.For example, the operation of obtaining a super-resolution image of a face image and/or key point coordinates of a face image may be performed using an up sampling amplification network, based on the updated image feature matrix corresponding to the last encoder cascaded. ), and obtaining a super-resolution image based on the first offset and the face image and/or based on the updated face a priori feature matrix corresponding to the last encoder cascaded, a key point prediction network. Obtaining a second offset using , and obtaining key point coordinates of the predicted face image based on the second offset and the initial key point coordinates of the face image, wherein the key point coordinates of the face image are the initial key point coordinates of the face image. It can be obtained by performing full concatenation on the a priori feature matrix.

예를 들어, 전자 장치는 제2 인코더의 업데이트된 이미지 특징 행렬 M23를 기반으로 업샘플링 증폭 네트워크를 이용하여 제1 오프셋을 획득하고, 제1 오프셋과 얼굴 이미지를 기반으로 초해상도 이미지를 획득할 수 있다.　 전자 장치는 제2 인코더의 업데이트된 얼굴 선험 특징 행렬 F22를 기반으로, 키 포인트 예측 네트워크를 이용하여 제2 오프셋을 획득하고, 제2 오프셋과 얼굴 이미지의 초기 키 포인트 좌표를 기반으로 얼굴 이미지의 예측 키 포인트 좌표를 구할 수 있다.　 이때, 얼굴 이미지의 초기 키 포인트 좌표는 초기 얼굴 선험 특징 행렬에서 전체 연결을 수행하여 획득될 수 있다.For example, the electronic device may obtain a first offset using an upsampling amplification network based on the updated image feature matrix M23 of the second encoder, and obtain a super-resolution image based on the first offset and the face image. there is. Based on the updated facial a priori feature matrix F22 of the second encoder, the electronic device obtains a second offset using a key point prediction network, and predicts the face image based on the second offset and the initial key point coordinates of the face image. You can obtain key point coordinates. At this time, the initial key point coordinates of the face image can be obtained by performing full concatenation on the initial facial a priori feature matrix.

예를 들어, 각 인코더의 제1 네트워크는 레이어 정규화 레이어(LN(layer normalization) layer) 및 피드포워드 네트워크 레이어(FFN(feedforward network) layer)를 더 포함하고, 이때, 현재 인코더의 융합된 이미지 특징 행렬을 획득하는 동작은, 위치 정보가 내장된 현재 인코더에 상응하는 이미지 특징 행렬, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬 및 현재 인코더에 상응하는 얼굴 선험 특징 행렬을 각각 쿼리 벡터, 키 벡터 및 값 벡터로서 교차 어텐션 모듈에 입력하여, 캐스케이드된 교차 어텐션 모듈, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 통해 현재 인코더의 융합된 이미지 특징 행렬을 획득하는 동작을 포함한다.For example, the first network of each encoder further includes a layer normalization (LN) layer and a feedforward network (FFN) layer, where the fused image feature matrix of the current encoder The operation of acquiring is an image feature matrix corresponding to the current encoder with embedded position information, a facial a priori feature matrix corresponding to the current encoder with embedded position information, and a facial a priori feature matrix corresponding to the current encoder, respectively, as a query vector and key. It includes the operation of inputting a vector and a value vector to the cross-attention module, and obtaining the fused image feature matrix of the current encoder through the cascaded cross-attention module, layer normalization layer, and feedforward network layer.

도 3은 일 실시예에 따른 하나 이상의 인코더들 중 하나의 제1 네트워크 레이어의 구조를 설명하기 위한 도면이다.FIG. 3 is a diagram illustrating the structure of a first network layer among one or more encoders according to an embodiment.

도 3을 참조하면, 제1 네트워크는 교차 어텐션 모듈, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 포함할 수 있다.Referring to FIG. 3, the first network may include a cross attention module, a layer normalization layer, and a feedforward network layer.

예를 들어, 위치 정보가 내장된 현재 인코더에 해당하는 이미지 특징 행렬 Q는 아래의 수학식 2를 통해 획득할 수 있다.For example, the image feature matrix Q corresponding to the current encoder with embedded location information can be obtained through Equation 2 below.

위의 수학식 2에서, 은 원본 특징 맵에서 해당 원본 특징 맵의 의 해당 특징 위치를 나타내며, 은 위치 정보가 내장되지 않은 각 인코더에 해당하는 이미지 특징 행렬을 나타낼 수 있다.In Equation 2 above, of the corresponding original feature map in the original feature map. Indicates the location of the corresponding feature, may represent an image feature matrix corresponding to each encoder without embedded location information.

예를 들어, 이 제1 레벨 특징 맵의 특징에 대응한다면, 는 를 기반으로 획득될 수 있으며, 는 제1 레벨 특징 맵에서의 의 위치를 나타낼 수 있다.for example, Features of this first level feature map If you correspond to Is It can be obtained based on is in the first level feature map can indicate the location of

예를 들어, 입력 교차 어텐션 모듈의 키 행렬 K는 아래의 수학식 3을 통해 획득될 수 있다.For example, the key matrix K of the input cross attention module can be obtained through Equation 3 below.

위의 수학식 3에서, 는 에서 해당 특징 맵에 대응하는 특징의 마지막 레벨 특징 맵(다시 말해, 마지막 레벨 특징)에서의 위치를 나타내고, 이때, 는 위치 정보가 포함되지 않은 각 인코더에 상응하는 얼굴 선험 특징 행렬을 나타낼 수 있다. 　예를 들어, 의 일부 특징의 마지막 레벨 특징 맵에서의 대응 특징은 이고, 는 마지막 레벨 특징 맵에서의 의 위치를 나타낼 수 있다.In Equation 3 above, Is Indicates the position in the last level feature map (that is, the last level feature) of the feature corresponding to the feature map, where: may represent a facial a priori feature matrix corresponding to each encoder that does not include location information. for example, Some features of The corresponding features in the last level feature map are ego, is in the last level feature map. can indicate the location of

또한, 입력 교차 어텐션 모듈의 값 행렬은 일 수 있다.　 교차 어텐션 모듈의 출력 MHCA(Q,K,V)은 아래의 수학식 4와 같이 나타낼 수 있다.Additionally, the value matrix of the input cross attention module is It can be. The output MHCA(Q, K, V) of the cross attention module can be expressed as Equation 4 below.

위의 수학식 4에서, 는 키 행렬의 행 벡터의 차원일 수 있다.In equation 4 above, may be the dimension of the row vector of the key matrix.

전자 장치는 교차 어텐션 모듈의 출력을 기반으로, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 이용하여, 각 인코더의 융합된 이미지 특징 행렬 을 획득할 수 있다.　 본 개시의 일 실시예에 따르면, 전자 장치는 각 인코더의 대응하는 이미지 특징 행렬 및 얼굴 선험 특징 행렬을 멀티 헤드 어텐션 레이어에 입력함으로써, 융합된 이미지 특징 행렬을 획득하고, 이러한 교차 어텐션 메커니즘을 기반으로 획득한 융합된 이미지 특징 행렬은 얼굴 선험 특징 정보를 통합하므로, 얼굴 이미지 특징 간의 상관 관계를 더 잘 반영할 수 있다.Based on the output of the cross-attention module, the electronic device uses a layer normalization layer and a feedforward network layer to generate the fused image feature matrix of each encoder. can be obtained. According to an embodiment of the present disclosure, the electronic device obtains a fused image feature matrix by inputting the corresponding image feature matrix and face a priori feature matrix of each encoder into a multi-head attention layer, and based on this cross attention mechanism. Since the obtained fused image feature matrix integrates facial a priori feature information, it can better reflect the correlation between facial image features.

설명의 편의상 원본 특징 맵의 특징에 대응하는 원본 특징 맵의 이미지 특징 행렬의 특징 위치는 원본 특징 맵의 이미지 특징 행렬의 특징 위치로 표현될 수 있다.For convenience of explanation, the feature positions of the image feature matrix of the original feature map corresponding to the features of the original feature map may be expressed as feature positions of the image feature matrix of the original feature map.

예를 들어, 제3 네트워크를 이용하여 현재 인코더의 업데이트된 이미지 특징 행렬을 획득하는 동작은, 현재 인코더의 융합된 이미지 특징 행렬 중 각 특징의 정규화 위치를 결정하는 동작 - 이때, 정규화 위치는 대응하는 특징 맵 내 각 특징에 대응하는 상기 특징 맵 내 특징의 정규화 위치를 나타냄 -, 멀티 레벨 특징 맵의 각 특징 맵에서 미리 설정된 규칙에 따라 각 특징의 정규화 위치 근처에서 K개의 정규화 위치를 결정하는 동작, 현재 인코더의 융합된 이미지 특징 행렬에서 멀티 레벨 특징 맵의 각 특징 맵의 K개의 정규화 위치에 상응하는 L*K개의 특징에 대해 가중 합산을 수행하여 현재 인코더의 융합된 이미지 특징 행렬 중 각 특징에 대응하는 특징을 현재 인코더의 업데이트된 이미지 특징 행렬 중의 특징으로 획득하는 동작을 포함하고, 이때, L은 멀티 레벨 특징 맵의 특징 맵 개수이고, 예를 들어 L=4일 수 있다.For example, the operation of acquiring the updated image feature matrix of the current encoder using a third network is the operation of determining the normalization position of each feature among the fused image feature matrix of the current encoder - at this time, the normalization position is the corresponding Indicates the normalization position of the feature in the feature map corresponding to each feature in the feature map - An operation of determining K normalization positions near the normalization position of each feature according to a preset rule in each feature map of the multi-level feature map; In the fused image feature matrix of the current encoder, weighted sum is performed on L*K features corresponding to the K normalized positions of each feature map of the multi-level feature map to correspond to each feature in the fused image feature matrix of the current encoder. It includes an operation of acquiring the feature as a feature in the updated image feature matrix of the current encoder, where L is the number of feature maps of the multi-level feature map, and may be, for example, L=4.

예를 들어, 제3 네트워크는 변형 가능한 어텐션 레이어(deformable-attention layer), 잔여 합산 및 레이어 정규화 레이어(residual summation and layer normalization layer)(Add&Norm) 및 피드포워드 네트워크 레이어(FFN)를 포함할 수 있다.For example, the third network may include a deformable-attention layer, a residual summation and layer normalization layer (Add&Norm), and a feedforward network layer (FFN).

이하에서는 제3 네트워크를 이용하여 인코더의 업데이트된 이미지 특징 행렬에서 특징을 얻기 위한 예시에 대해 설명한다.Below, an example of obtaining features from the encoder's updated image feature matrix using a third network will be described.

예를 들어, 제1 네트워크에 의해 출력된 융합된 이미지 특징 행렬에서 각 특징에 해당하는 레이어 정보 및 위치 정보는 아래의 수학식 5에 의해 각 특징에 삽입되고 추가될 수 있다.For example, in the fused image feature matrix output by the first network, layer information and position information corresponding to each feature can be inserted and added to each feature by Equation 5 below.

위의 수학식 5에서, 은 융합된 이미지 특징 행렬에서의 i번째 특징을 나타내고, 은 i번째 특징에 대응하는 원본 특징 맵(다시 말해, 멀티 레벨 특징 맵에서의 일부 레벨 특징 맵)을 나타내고, 은 i번째 특징에 대응하는 원본 특징 맵에서의 i번째 특징에 대응하는 특징의 원본 특징 맵에서의 위치를 나타내고, 는 레이어 정보와 위치 정보가 추가된 i번째 특징을 나타낼 수 있다.In Equation 5 above, represents the ith feature in the fused image feature matrix, represents the original feature map (that is, some level feature map in a multi-level feature map) corresponding to the ith feature, represents the position in the original feature map of the feature corresponding to the ith feature in the original feature map corresponding to the ith feature, may represent the i-th feature to which layer information and location information are added.

예를 들어, 정규화된 좌표를 이용하여 해당 원본 특징 맵에서 각 특징의 공간적 위치 를 나타낼 수 있고, 는 융합된 이미지 특징 행렬에서의 i번째 특징의 그 대응하는 원본 특징 맵에서의 정규화된 공간적 위치를 나타낼 수 있다.　 예를 들어, (0,0) 및 (1,1)는 각각 원본 특징 맵의 좌측 상단 위 및 우측 하단 특징에 해당하는 정규화된 공간적 위치를 나타낼 수 있다.　 이러한 정규화된 좌표는 관련 특징 샘플링을 위한 기준점으로 이용될 수 있다.For example, the spatial location of each feature in its original feature map using its normalized coordinates. can represent, may represent the normalized spatial location of the ith feature in the fused image feature matrix in its corresponding original feature map. For example, (0,0) and (1,1) may represent the normalized spatial positions corresponding to the upper left and lower right features of the original feature map, respectively. These normalized coordinates can be used as reference points for sampling relevant features.

예를 들어, 융합된 이미지 특징 행렬에서 의 경우. 해당 원본 특징 맵의 정규화된 좌표는 이고, 전자 장치는 　주변에서 복수의 특징을 샘플링하여 를 로 업데이트할 수 있다.　 복수의 샘플링 특징에 대응하는 정규화 좌표는 아래의 수학식 6으로 표현될 수 있다.For example, in the fused image feature matrix In the case of. The normalized coordinates of the corresponding original feature map are And the electronic device is By sampling multiple features from the surroundings, cast It can be updated with . Normalized coordinates corresponding to a plurality of sampling features can be expressed as Equation 6 below.

위의 수학식 6에서, 는 원본 특징 맵에서 샘플링된 특징에 대응하는 정규화된 좌표이고, 이고, k=1, ??, K이고, K는 미리 설정된 값일 수 있다.In Equation 6 above, are the normalized coordinates corresponding to the features sampled from the original feature map, , k=1, ??, K, and K may be a preset value.

전자 장치는 다중 샘플링 특징에 대응하는 정규화 좌표를 결정한 후, 결정된 정규화 좌표를 기반으로 융합된 이미지 행렬과 에 대응하는 특징 을 결정할 수 있고, 아래의 수학식 7을 이용하여 를 결정할 수 있다.After determining the normalization coordinates corresponding to the multi-sampling features, the electronic device generates a fused image matrix based on the determined normalization coordinates. Features corresponding to can be determined, and using equation 7 below, can be decided.

위의 수학식 7에서, 과 는 학습 가능한 가중치 행렬이고, 는 아래의 수학식 8 또는 수학식 9로부터 획득될 수 있다.In Equation 7 above, class is the learnable weight matrix, Can be obtained from Equation 8 or Equation 9 below.

예를 들어, 전자 장치는, 피라미드 특징 맵의 제2 레벨 특징 맵의 특징 에 대응하는, 융합된 이미지 특징 행렬의 일부 특징 에 대해, 제2 레벨 특징 맵에서의 의 정규화 위치 를 결정하고, 위의 수학식 6을 기반으로 　근처의 K개 좌표를 결정할 수 있다.　 이때, 해당 K개 좌표는 제2 레벨 특징 맵에서 K개 좌표에 대응하고, 해당 K개 좌표는 융합된 이미지 행렬에서의 K개의 특징에 대응할 수 있다.　 다시 말해, 융합된 이미지 특징 행렬에서 해당 K개 좌표에 대응하는 K개 특징에 대해, 전자 장치는 위의 수학식 7을 이용하여 에 대응하는 업데이트된 특징 을 연산할 수 있다.For example, the electronic device may include a feature of a second level feature map of the pyramid feature map. Some features of the fused image feature matrix, corresponding to For, in the second level feature map Normalized position of Determine, and based on Equation 6 above, K coordinates nearby can be determined. At this time, the K coordinates may correspond to K coordinates in the second level feature map, and the K coordinates may correspond to K features in the fused image matrix. In other words, for the K features corresponding to the K coordinates in the fused image feature matrix, the electronic device uses Equation 7 above to Updated features corresponding to can be calculated.

앞서 언급한 바와 같이, 전자 장치는 을 연산함에 있어서, 에 대응하는 원본 특징 맵(예를 들어, 제2 레벨 특징 맵)만을 기반으로 융합된 이미지 특징 행렬에서 K개 특징을 샘플링하여 를 획득할 수 있다.As mentioned earlier, electronic devices In calculating, By sampling K features from the fused image feature matrix based only on the original feature map (e.g., second-level feature map) corresponding to can be obtained.

또 다른 실시예로, 전자 장치는 멀티 레벨 이미지 특징을 통합하기 위해 에 대해, 각 레벨 특징 맵에서 K개 특징을 샘플링하고, 아래의 수학식 10을 통해 를 획득할 수 있다.In another embodiment, an electronic device may be configured to integrate multi-level image features. For , K features are sampled from each level feature map, and through Equation 10 below: can be obtained.

위의 수학식 10에서, L은 멀티 레벨 특징 맵의 개수(예를 들어, 추출된 피라미드 특징 맵이 4레벨 특징 맵인 경우 L=4)이고, 는 j번째 레벨 특징 맵을 기반으로 융합된 이미지 행렬에서 샘플링된 k번째 특징을 나타낼 수 있다.　 에 대응하는 위치 좌표는 이고, 이때 는 아래의 수학식 11을 통해 의 선형 투영을 통해 획득될 수 있다.In Equation 10 above, L is the number of multi-level feature maps (for example, if the extracted pyramid feature map is a 4-level feature map, L = 4), may represent the kth feature sampled from the fused image matrix based on the jth level feature map. The position coordinates corresponding to are And at this time Through Equation 11 below, It can be obtained through a linear projection of .

예를 들어, 전자 장치는 융합된 이미지 특징 행렬의 일부 특징에 대해, 위의 수학식 11을 통해 각 레벨 특징 맵의 K개의 좌표를 결정할 수 있고, 융합된 이미지 특징 행렬 중 각 레벨 특징 맵의 K개 좌표에 대응하는 K개의 특징을 결정할 수 있다.　 이런 경우, 전자 장치는 융합된 이미지 특징 행렬 중 K*L개의 특징을 결정할 수 있으며, 수학식 10을 기반으로 업데이트된 를 결정할 수 있다.For example, the electronic device may use some features of the fused image feature matrix. For , K coordinates of each level feature map can be determined through Equation 11 above, and K features corresponding to the K coordinates of each level feature map among the fused image feature matrices can be determined. In this case, the electronic device can determine K*L features from the fused image feature matrix, and the updated can be decided.

도 4는 일 실시예에 따른 변형 가능한 어텐션 메커니즘(deformable attention mechanism)에 기반하여 이미지 특징 행렬을 업데이트하는 동작을 설명하기 위한 도면이다.FIG. 4 is a diagram illustrating an operation of updating an image feature matrix based on a deformable attention mechanism according to an embodiment.

도 4에서는 융합된 이미지 특징 행렬 을 기반으로 변형 가능한 어텐션 메커니즘을 이용하여 업데이트된 이미지 특징 행렬을 획득하는 예시도가 도시된다.In Figure 4, the fused image feature matrix Image feature matrix updated using a deformable attention mechanism based on An example diagram of obtaining is shown.

도 4를 참조하면, 전자 장치는 의 멀티 레벨 특징 맵에 대응하는 특징을 이용하여 의 각 레벨 특징 맵에 대응하는 특징을 업데이트하고, 이를 통해 특징 행렬 의 각 특징 맵에 대응하는 특징을 획득할 수 있다.　 도 4에는 제3 네트워크 레이어의 구조를 설명하기 위한 예시가 도시된다.Referring to Figure 4, the electronic device By using the features corresponding to the multi-level feature map of Update the features corresponding to each level feature map, and through this, the feature matrix Features corresponding to each feature map can be obtained. Figure 4 shows an example to explain the structure of the third network layer.

예를 들어, 전자 장치는 의 일부 특징에 대해, 의 제1 레벨 특징 맵에 대응하는 특징에서 K개의 특징을 획득하고, 의 제2 레벨 특징 맵에 대응하는 특징에서 K개의 특징을 획득하고, 의 제3 레벨 특징 맵에 대응하는 특징에서 K개의 특징을 획득하고, 의 제1 레벨 특징 맵에 대응하는 특징에서 K개의 특징을 획득하고, 획득한 4K개 특징을 기반으로 일부 특징의 업데이트된 특징을 에서 일부 특징에 상응하는 특징으로 결정할 수 있다.　 따라서, 에서 예를 들어 제1 레벨 특징 맵에 상응하는 특징은 각 레벨 특징 맵의 정보를 포함할 수 있다.For example, electronic devices About some features of Obtain K features from the features corresponding to the first level feature map, Obtain K features from the features corresponding to the second level feature map, Obtain K features from the features corresponding to the third level feature map, K features are acquired from the features corresponding to the first level feature map, and updated features of some features are obtained based on the obtained 4K features. can be determined as a feature corresponding to some feature. thus, For example, the feature corresponding to the first level feature map may include information of each level feature map.

의 각 특징은 의 멀티 레벨 특징 맵에 대응하는 특징의 정보를 통합하고, 이를 통해 하위 수준의 특징 맵에 대응하는 정보를 보다 더 잘 고려할 수 있으므로, 얼굴 이미지의 로컬 특징 간의 상관 관계를 보다 더 잘 반영할 수 있다. Each feature of By integrating the information of the features corresponding to the multi-level feature maps, this allows better consideration of the information corresponding to the lower-level feature maps, thereby better reflecting the correlation between local features of the face image. .

본 명세서에서 설명한 각 특징 맵을 기반으로 K개 특징을 샘플링하는 방법은 단지 하나의 예시일 뿐이고, K 또는 L*K개 특징을 다른 방법으로 수집하여 제2 이미지 행렬의 특징을 업데이트하는 것도 제한 없이 적용될 수 있다.The method of sampling K features based on each feature map described in this specification is only an example, and collecting K or L*K features in another method to update the features of the second image matrix is also possible without limitation. It can be applied.

예를 들어, 융합된 이미지 특징 행렬은 일부 특징 맵의 특징만 샘플링하여 업데이트된 이미지 특징 행렬로 업데이트될 수 있다.For example, the fused image feature matrix can be updated into an updated image feature matrix by sampling only the features of some feature maps.

예를 들어, 각 인코더의 제2 네트워크는 셀프 어텐션 모듈을 더 포함하고, 이때, 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작은, 현재 인코더에 상응하는 얼굴 선험 특징 행렬에 기초하여, 셀프 어텐션 모듈을 이용하여 현재 인코더에 상응하는 셀프 어텐션 얼굴 선험 특징 행렬을 획득하는 동작, 현재 인코더에 상응하는 셀프 어텐션 얼굴 선험 특징 행렬 및 현재 인코더의 융합된 이미지 특징 행렬에 기초하여, 제1 변형 가능한 어텐션 모듈을 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작을 포함할 수 있다.For example, the second network of each encoder further includes a self-attention module, wherein the operation of obtaining the updated facial a priori feature matrix of the current encoder is based on the facial a priori feature matrix corresponding to the current encoder, An operation of using an attention module to obtain a self-attention face a priori feature matrix corresponding to the current encoder, based on the self-attention face a priori feature matrix corresponding to the current encoder and the fused image feature matrix of the current encoder, a first deformable attention It may include an operation of obtaining an updated facial a priori feature matrix of the current encoder using a module.

예를 들어, 셀프 어텐션 모듈은 캐스케이드된 셀프 어텐션 레이어, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 포함하고, 이때, 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬을 획득하는 동작은, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬 및 현재 인코더에 상응하는 얼굴 선험 특징 행렬을 각각 쿼리 행렬, 키 행렬 및 값 행렬로서 셀프 어텐션 레이어에 입력하여, 캐스케이드된 셀프 어텐션 레이어, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 통해 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬을 획득하는 동작을 포함할 수 있다.　 이때, 제2 변형 가능한 어텐션 모듈을 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하는 동작은, 마지막 레벨 특징 맵에서 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬 중 각 특징의 정규화 위치를 결정하는 동작 - 이때, 정규화 위치는 마지막 레벨 특징 맵에서 각 특징에 상응하는 마지막 레벨 특징 맵에서의 특징의 정규화 위치를 나타냄 -, 미리 설정된 규칙에 따라 최종 레벨 특징 맵에서 정규화 위치 근처의 K개의 정규화 위치를 결정하는 동작, 현재 인코더의 업데이트된 이미지 특징 행렬에서 K개의 정규화 위치에 상응하는 K개의 특징을 결정하고, K개의 특징에 가중치를 합산하여 셀프 어텐션 얼굴 선험 특징 행렬 중 각 특징에 상응하는 특징을 현재 인코더의 업데이트된 얼굴 선험 특징 행렬의 특징으로서 획득하는 동작을 포함할 수 있다.For example, the self-attention module includes a cascaded self-attention layer, a layer normalization layer, and a feedforward network layer, where the operation of acquiring the self-attention face a priori feature matrix of the current encoder is the current encoder with embedded location information. Input the corresponding facial a priori feature matrix, the facial a priori feature matrix corresponding to the current encoder with embedded location information, and the facial a priori feature matrix corresponding to the current encoder as the query matrix, key matrix, and value matrix, respectively, into the self-attention layer, It may include an operation of acquiring the self-attention facial a priori feature matrix of the current encoder through a cascaded self-attention layer, a layer normalization layer, and a feedforward network layer. At this time, the operation of acquiring the updated facial a priori feature matrix of the current encoder using the second deformable attention module is the operation of determining the normalization position of each feature among the self-attention facial a priori feature matrix of the current encoder in the last level feature map. - At this time, the normalization position represents the normalization position of the feature in the last level feature map corresponding to each feature in the last level feature map. -, Determine K normalization positions near the normalization position in the final level feature map according to preset rules. The operation is to determine the K features corresponding to the K normalization positions in the updated image feature matrix of the current encoder, add the weights to the K features, and select the features corresponding to each feature in the self-attention face a priori feature matrix to the current encoder. It may include an operation of acquiring features of the updated facial a priori feature matrix.

도 5는 일 실시예에 따른 제2 네트워크 레이어의 구조를 설명하기 위한 도면이다.Figure 5 is a diagram for explaining the structure of a second network layer according to an embodiment.

도 5를 참조하면, 먼저 전자 장치는 셀프 어텐션 모듈을 기반으로 셀프 어텐션 얼굴 선험 특징 행렬 을 획득할 수 있다.　 이때, 셀프 어텐션 모듈은 셀프 어텐션 레이어, 정규화 레이어(Add&Norm) 및 피드포워드 네트워크 레이어(FFN)를 포함할 수 있다.　 입력된 셀프 어텐션 레이어의 쿼리 행렬, 키 행렬 및 값 행렬은 각각 위치 정보가 내장된 각 인코더에 해당하는 얼굴 선험 특징 행렬, 위지 정보가 내장된 각 인코더에 해당하는 얼굴 선험 특징 행렬 및 각 인코더에 해당하는 얼굴 선험 특징 행렬일 수 있다.　 전자 장치는 셀프 어텐션 메커니즘을 기반으로 얼굴 선험 특징 행렬을 업데이트하면 얼굴 선험 특징 간의 종속성을 학습할 수 있다.　 그래서, 셀프 어텐션 메커니즘을 기반으로 획득한 셀프 어텐션 얼굴 선험 특징은 얼굴 선험 특징의 구조적 정보를 반영할 수 있으며 입력 이미지를 보다 더 잘 표현할 수 있다.Referring to Figure 5, first, the electronic device generates a self-attention facial a priori feature matrix based on the self-attention module. can be obtained. At this time, the self-attention module may include a self-attention layer, a normalization layer (Add&Norm), and a feedforward network layer (FFN). The query matrix, key matrix, and value matrix of the input self-attention layer are, respectively, a facial a priori feature matrix corresponding to each encoder with embedded location information, a facial a priori feature matrix corresponding to each encoder with embedded location information, and a facial a priori feature matrix corresponding to each encoder. It may be a facial a priori feature matrix. Electronic devices can learn dependencies between facial a priori features by updating the facial a priori feature matrix based on a self-attention mechanism. Therefore, the self-attention facial a priori features obtained based on the self-attention mechanism can reflect the structural information of the facial a priori features and better represent the input image.

셀프 어텐션 모듈의 셀프 어텐션 레이어 출력 은 아래의 수학식 12로 표현될 수 있다.Self-attention layer output of self-attention module Can be expressed as Equation 12 below.

위의 수학식 12에서, Q, K, V는 각각 셀프 어텐션 레이어에 입력되는 쿼리 행렬, 키 행렬 및 값 행렬을 나타낼 수 있다.In Equation 12 above, Q, K, and V may represent a query matrix, key matrix, and value matrix respectively input to the self-attention layer.

전자 장치는 를 기반으로 레이어 정규화 레이어(Add&Norm)와 피드포워드 네트워크 레이어(FFN)를 이용하여 를 획득할 수 있다.electronic devices Based on this, using the layer normalization layer (Add&Norm) and feedforward network layer (FFN) can be obtained.

그런 다음, 전자 장치는 를 기반으로 변형 가능한 어텐션 메커니즘을 이용하여 업데이트된 얼굴 선험 특징 행렬 을 획득할 수 있다.Then, the electronic device Facial a priori feature matrix updated using a deformable attention mechanism based on can be obtained.

도 2를 참조하면, 변형 가능한 어텐션 모듈은 변형 가능한 어텐션 레이어, 레이어 정규화 레이어(LN) 및 피드포워드 네트워크 레이어(FFN)를 포함할 수 있다.Referring to FIG. 2, the deformable attention module may include a deformable attention layer, a layer normalization layer (LN), and a feedforward network layer (FFN).

이하, 업데이트된 얼굴 선험 특징 행렬 에 대한 연산 프로세스의 예시에 대해 설명한다.Below, the updated facial a priori feature matrix. An example of the calculation process for will be described.

예를 들어, 전자 장치는 셀프 어텐션 얼굴 선험 특징 행렬의 특징 에 대해, 먼저 그 대응하는 원본 특징 맵(다시 말해, 마지막 레벨 특징 맵)에서의 원본 특징 맵의 특징의 정규화 위치를 결정할 수 있다.　 예를 들어, (0,0) 및 (1,1)은 각각 마지막 레벨 특징 맵에서 좌측 상단 및 우측 하단의 마지막 레벨 특징 맵에서의 특징의 공간적 위치 를 나타낼 수 있다.For example, an electronic device may use a self-attention facial a priori feature matrix For , we can first determine the normalized position of the features of the original feature map in the corresponding original feature map (i.e., the last level feature map). For example, (0,0) and (1,1) are the spatial positions of features in the last level feature map, top left and bottom right, respectively. can represent.

그런 다음, 전자 장치는 아래의 수학식 13에 따라 　주변의 K개의 위치를 결정한다.Then, the electronic device follows equation 13 below: Determine K locations in the surrounding area.

위의 수학식 13에서, 는 이고, 는 에 대응하는 내장된 위치 정보의 얼굴 선험 특징을 나타낼 수 있다.In Equation 13 above, Is ego, Is The facial a priori features of the embedded location information corresponding to can be expressed.

전자 장치는 융합된 이미지 특징 행렬에서 해당 K개 위치에 해당하는 특징을 결정할 수 있다.　 예를 들어, 융합된 이미지 특징 행렬에는 최종 레벨 특징 맵의 각 정규화 위치에 각각 해당하는 특징들이 존재하고, K개의 정규화 위치를 기반으로 융합된 이미지 특징 행렬의 K 개 특징이 결정될 수 있다.The electronic device can determine features corresponding to the K positions in the fused image feature matrix. For example, in the fused image feature matrix, there are features corresponding to each normalization position of the final level feature map, and K features of the fused image feature matrix can be determined based on the K normalization positions.

그런 다음, 는 아래의 수학식 14에 기반하여 해당 K개 특징에 기초하여, 업데이트된 얼굴 선험 특징 행렬 중 에 대응하는 특징으로서, 로 업데이트될 수 있다.after that, is one of the updated facial a priori feature matrices based on the K features based on Equation 14 below. As a feature corresponding to, can be updated with

위의 수학식 14에서, W는 학습 가능한 가중치 행렬이고, 이고, 또는 일 수 있다.In Equation 14 above, W is the learnable weight matrix, and, or It can be.

도 6은 일 실시예에 따른 얼굴 이미지의 초해상도 이미지 및/또는 얼굴 이미지의 키 포인트 좌표를 획득하는 동작을 설명하기 위한 도면이다.FIG. 6 is a diagram illustrating an operation of acquiring a super-resolution image of a face image and/or key point coordinates of a face image according to an embodiment.

도 6을 참조하면, 캐스케이드된 T개의 인코더들은 마지막 인코더의 업데이트된 이미지 특징 행렬 및/또는 업데이트된 얼굴 선험 특징 행렬을 획득할 수 있다.　 전자 장치는 업샘플링 네트워크를 통해 제1 오프셋을 획득하여 얼굴 이미지의 초해상도 이미지를 획득하고, 및/또는 키 포인트 예측 네트워크를 통해 제2 오프셋을 획득하여 예측된 얼굴 이미지의 키 포인트 좌표를 획득할 수 있다.Referring to FIG. 6, the T cascaded encoders may obtain the updated image feature matrix and/or the updated facial a priori feature matrix of the last encoder. The electronic device may obtain a super-resolution image of the face image by obtaining a first offset through an upsampling network, and/or obtain key point coordinates of the predicted face image by obtaining a second offset through a key point prediction network. You can.

예를 들어, 전자 장치는 마지막 인코더의 업데이트된 이미지 특징 행렬에서 제1 레벨의 특징 맵에 상응하는 특징을 잘라내고, 컨볼루션 레이어와 픽셀 셔플 레이어(pixel shuffle layer)를 통해 제1 레벨의 특징 맵에 상응하는 특징에 대해 업샘플링 및 증폭함으로써, 초해상도 이미지 　및 LR 이미지 　(다시 말해, 입력 이미지) 간의 오프셋 을 획득할 수 있다.　 전자 장치는 아래의 수학식 15에 기초하여 초해상도 이미지를 획득할 수 있다.For example, the electronic device cuts out features corresponding to the first-level feature map from the updated image feature matrix of the last encoder, and outputs the first-level feature map through a convolution layer and a pixel shuffle layer. By upsampling and amplifying the corresponding features, the super-resolution image and LR images (i.e., offset between input images) can be obtained. The electronic device can acquire a super-resolution image based on Equation 15 below.

예를 들어, 전자 장치는 아래의 수학식 16에 기반하여 MLP(Multi-layer Perceptron)를 통해 예측된 얼굴 이미지의 키 포인트 좌표 및 초기의 얼굴 이미지의 키 포인트 좌표 간의 오프셋 을 획득할 수 있다.For example, the electronic device determines the offset between the key point coordinates of the face image predicted through MLP (Multi-layer Perceptron) and the key point coordinates of the initial face image based on Equation 16 below. can be obtained.

위의 수학식 16에서, 는 마지막 인코더의 얼굴 선험 특징 행렬을 나타낼 수 있다.In equation 16 above, may represent the facial a priori feature matrix of the last encoder.

예를 들어, 전자 장치는 ReLU(Rectified Linear Unit) 활성화 함수가 있는 3레이어 완전 연결 네트워크를 이용하여 를 획득할 수 있다.　 이때, 앞 두 레이어는 ReLU 활성화 함수가 이어지는 선형 완전 연결로 구성되며, 마지막 레이어는 완전 연결을 통해 를 직접 출력할 수 있다.For example, electronic devices use a three-layer fully connected network with a Rectified Linear Unit (ReLU) activation function to can be obtained. At this time, the first two layers are composed of linear complete connection followed by a ReLU activation function, and the last layer is composed of full connection through full connection. can be printed directly.

가 획득된 후, 얼굴 이미지의 예측된 키 포인트 좌표는 아래의 수학식 17에 의해 획득될 수 있다. After is acquired, the predicted key point coordinates of the face image can be obtained by Equation 17 below.

위의 수학식 17에서, 는 　함수를 나타내고, 는 얼굴 이미지의 초기 키 포인트 좌표를 나타낼 수 있다.In equation 17 above, Is represents a function, may represent the initial key point coordinates of the face image.

예를 들어, 초해상도 이미지 및/또는 예측된 얼굴 이미지의 키 포인트 좌표를 획득하기 위해 이용되는 전체 모델은 FSR을 위한 Loss 함수와 얼굴 마킹을 위한 Loss 함수를 함께 이용하여 트레이닝될 수 있다.For example, the entire model used to obtain key point coordinates of a super-resolution image and/or a predicted face image may be trained using a Loss function for FSR and a Loss function for face marking.

예를 들어, FSR의 손실에는 픽셀 손실(Pixel loss), 적대적 손실(Adversarial loss) 및 지각 손실(Perceptual loss)이 포함될 수 있고, 얼굴 마킹의 손실에는 일관성 손실(Consistency loss) 및 분리 제약(Separation constraint)이 포함될 수 있다.For example, the loss in FSR may include Pixel loss, Adversarial loss, and Perceptual loss, while the loss in face marking may include Consistency loss and Separation constraint. ) may be included.

FSR은 저해상도의 얼굴 이미지를 고해상도의 얼굴 이미지로 복원할 수 있으며, 저해상도 학습 이미지를 FSR 네트워크에 입력하여 대응하는 고해상도 정답 이미지를 얻을 수 있도록 FSR 네트워크가 트레이닝될 수 있다.　 이를 통해, 추가적인 얼굴 선험 정보 없이 FSR 네트워크의 복잡도를 줄이면서 FSR 성능을 효과적으로 향상시킬 수 있다.FSR can restore a low-resolution face image to a high-resolution face image, and the FSR network can be trained to obtain a corresponding high-resolution correct image by inputting a low-resolution learning image into the FSR network. Through this, FSR performance can be effectively improved while reducing the complexity of the FSR network without additional facial a priori information.

도 1 내지 도 6을 참조하여 본 개시의 실시예에 따른 전자 장치에서의 얼굴 이미지 처리 방법에 대해 설명하였다.　 이하, 도 7을 참조하여 본 개시의 실시예에 따른 얼굴 이미지를 처리하는 전자 장치에 대해 설명한다.A facial image processing method in an electronic device according to an embodiment of the present disclosure has been described with reference to FIGS. 1 to 6 . Hereinafter, an electronic device for processing a face image according to an embodiment of the present disclosure will be described with reference to FIG. 7.

도 7은 일 실시예에 따른 전자 장치를 나타낸 도면이다.Figure 7 is a diagram showing an electronic device according to an embodiment.

도 7을 참조하면, 얼굴 이미지를 처리하는 전자 장치(700)는 제1획득 유닛(701), 제2획득 유닛(702) 및 제3획득 유닛(703)을 포함할 수 있다.　 다만, 전자 장치(700)는 전술한 예들로 제한되지 않으며, 다른 구성 요소를 추가적으로 포함할 수 있고, 전자 장치(700)의 하나 이상의 구성 요소들이 분할 또는 결합될 수도 있다.Referring to FIG. 7 , the electronic device 700 that processes a facial image may include a first acquisition unit 701, a second acquisition unit 702, and a third acquisition unit 703. However, the electronic device 700 is not limited to the examples described above, and may additionally include other components, and one or more components of the electronic device 700 may be divided or combined.

예를 들어, 제1 획득 유닛(701)은 얼굴 이미지의 멀티 레벨 특징 맵에 기초하여 얼굴 이미지의 초기 이미지 특징 행렬을 획득할 수 있다.For example, the first acquisition unit 701 may acquire the initial image feature matrix of the face image based on the multi-level feature map of the face image.

예를 들어, 제2 획득 유닛(702)은 멀티 레벨 특징 맵의 마지막 레벨 특징 맵에 기초하여 얼굴 이미지의 초기 얼굴 선험 특징 행렬을 획득할 수 있다.For example, the second acquisition unit 702 may acquire the initial facial a priori feature matrix of the face image based on the last level feature map of the multi-level feature map.

예를 들어, 제3 획득 유닛(703)은 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 캐스케이드된 하나 이상의 인코더를 이용하여 얼굴 이미지의 초해상도 이미지 및/또는 얼굴 이미지의 키 포인트 좌표를 획득할 수 있다.For example, the third acquisition unit 703 acquires a super-resolution image of the face image and/or key point coordinates of the face image using one or more encoders cascaded based on the initial image feature matrix and the initial facial a priori feature matrix. can do.

예를 들어, 제3 획득 유닛(703)은, 하나 이상의 인코더들에 포함된 교차 어텐션 모듈을 이용하여, 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 융합된 이미지 특징 행렬을 획득하고, 하나 이상의 인코더들에 포함된 제1 변형 가능한 어텐션 모듈을 이용하여, 융합된 이미지 특징 행렬을 기반으로 얼굴 이미지의 업데이트된 이미지 특징 행렬을 획득하고, 업데이트된 이미지 특징 행렬 및 얼굴 이미지를 기반으로 얼굴 이미지의 초해상도 이미지를 획득할 수 있다.For example, the third acquisition unit 703 acquires a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in one or more encoders, and one Using the first deformable attention module included in the above encoders, an updated image feature matrix of the face image is obtained based on the fused image feature matrix, and an updated image feature matrix of the face image is obtained based on the updated image feature matrix and the face image. Super-resolution images can be obtained.

예를 들어, 제3 획득 유닛(703)은, 하나 이상의 인코더들에 포함된 교차 어텐션 모듈을 이용하여, 초기 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 융합된 이미지 특징 행렬을 획득하고, 하나 이상의 인코더들에 포함된 제2 변형 가능한 어텐션 모듈을 이용하여, 융합된 이미지 특징 행렬 및 초기 얼굴 선험 특징 행렬을 기반으로 얼굴 이미지의 업데이트된 얼굴 선험 특징을 획득하고, 업데이트된 얼굴 선험 특징 및 얼굴 이미지의 초기 키 포인트 좌표를 기반으로, 얼굴 이미지의 키 포인트 좌표를 예측할 수 있다.　 이때, 얼굴 이미지의 초기 키 포인트 좌표는 초기 얼굴 선험 특징 행렬을 기반으로 획득될 수 있다.For example, the third acquisition unit 703 acquires a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in one or more encoders, and one Using the second deformable attention module included in the above encoders, updated facial a priori features of the face image are obtained based on the fused image feature matrix and the initial facial a priori feature matrix, and the updated facial a priori features and the face image are obtained. Based on the initial key point coordinates of , the key point coordinates of the face image can be predicted. At this time, the initial key point coordinates of the face image can be obtained based on the initial facial a priori feature matrix.

예를 들어, 하나 이상의 인코더들 각각은 제1 네트워크, 제2 네트워크 및 제3 네트워크를 포함하고, 제1 네트워크는 교차 어텐션 모듈을 포함하고, 제2 네트워크는 제1 변형 가능한 어텐션 모듈을 포함하며, 제3 네트워크는 제2 변형 가능한 어텐션 모듈을 포함할 수 있다.For example, each of the one or more encoders includes a first network, a second network, and a third network, the first network including a cross attention module, the second network including a first deformable attention module, and The third network may include a second deformable attention module.

예를 들어, 제3 획득 유닛(703)은, 각각의 인코더에 대해, 현재 인코더에 대응하는 이미지 특징 행렬 및 얼굴 선험 특징 행렬에 기초하여, 제1 네트워크를 이용하여 현재 인코더의 융합된 이미지 특징 행렬을 획득하고, 현재 인코더의 융합된 이미지 특징 행렬 및 현재 인코더에 대응하는 얼굴 선험 특징 행렬에 기초하여, 제2 네트워크를 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득하고, 현재 인코더의 융합된 이미지 특징 행렬을 기반으로, 제3 네트워크를 이용하여 현재 인코더의 업데이트된 이미지 특징 행렬을 획득할 수 있다.　 제3 획득 유닛(703)은, 하나 이상의 인코더들의 마지막 인코더의 업데이트된 이미지 특징 행렬 및 얼굴 이미지를 기반으로, 얼굴 이미지의 초해상도 이미지를 획득하거나, 및/또는, 마지막 인코더의 업데이트된 얼굴 선험 특징 행렬 및 얼굴 이미지의 초기 키 포인트 좌표에 기초하여, 얼굴 이미지의 키 포인트 좌표를 예측할 수 있다.　 이때, 현재 인코더가 하나 이상의 인코더들 중 첫 번째 인코더인 경우, 현재 인코더에 대응하는 이미지 특징 행렬은 초기 이미지 특징 행렬이고, 현재 인코더에 대응하는 얼굴 선험 특징 행렬은 초기 얼굴 선험 특징 행렬일 수 있다.　 현재 인코더가 첫 번째 인코더가 아닌 경우, 현재 인코더에 대응하는 이미지 특징 행렬은 현재 인코더의 이전 인코더의 업데이트된 이미지 특징 행렬이고, 현재 인코더에 대응하는 얼굴 선험 특징 행렬은 현재 인코더의 이전 인코더의 업데이트된 얼굴 선험 특징 행렬일 수 있다.For example, the third acquisition unit 703, for each encoder, uses the first network to determine the fused image feature matrix of the current encoder, based on the image feature matrix and the face a priori feature matrix corresponding to the current encoder. Obtain, based on the fused image feature matrix of the current encoder and the facial a priori feature matrix corresponding to the current encoder, use a second network to obtain an updated facial a priori feature matrix of the current encoder, and obtain the fused image feature matrix of the current encoder. Based on the image feature matrix, an updated image feature matrix of the current encoder can be obtained using a third network. The third acquisition unit 703 acquires a super-resolution image of the face image based on the face image and the updated image feature matrix of the last encoder of the one or more encoders, and/or updated face a priori features of the last encoder. Based on the matrix and the initial key point coordinates of the face image, the key point coordinates of the face image can be predicted. At this time, if the current encoder is the first encoder among one or more encoders, the image feature matrix corresponding to the current encoder may be an initial image feature matrix, and the facial a priori feature matrix corresponding to the current encoder may be an initial facial a priori feature matrix. If the current encoder is not the first encoder, the image feature matrix corresponding to the current encoder is the updated image feature matrix of the previous encoder of the current encoder, and the face a priori feature matrix corresponding to the current encoder is the updated image feature matrix of the previous encoder of the current encoder. It may be a facial a priori feature matrix.

예를 들어, 제3 획득 유닛(703)은, 캐스케이드된 마지막 인코더에 대응하는 업데이트된 이미지 특징 행렬을 기반으로, 업샘플링 증폭 네트워크를 이용하여 제1 오프셋을 구하고, 제1 오프셋과 얼굴 이미지를 기반으로 초해상도 이미지를 획득하고, 및/또는 캐스케이드된 마지막 인코더에 대응하는 업데이트된 얼굴 선험 특징 행렬을 기반으로, 키 포인트 예측 네트워크를 이용하여 제2 오프셋을 구하고, 제2 오프셋과 얼굴 이미지의 초기 키 포인트 좌표를 기반으로 예측된 얼굴 이미지의 키 포인트 좌표를 획득할 수 있다.　 이때, 얼굴 이미지의 초기 키 포인트 좌표는 초기 얼굴 선험 특징 행렬을 완전히 연결하여 획득할 수 있다.For example, the third acquisition unit 703 obtains the first offset using an upsampling amplification network, based on the updated image feature matrix corresponding to the last encoder cascaded, and based on the first offset and the face image. acquire a super-resolution image, and/or obtain a second offset using a key point prediction network based on the updated facial a priori feature matrix corresponding to the last encoder cascaded, and obtain the second offset and the initial key of the face image. Based on the point coordinates, the key point coordinates of the predicted face image can be obtained. At this time, the initial key point coordinates of the face image can be obtained by completely concatenating the initial facial a priori feature matrix.

예를 들어, 각 인코더의 제1 네트워크는 레벨 정규화 레이어 및 피드포워드 네트워크 레이어를 더 포함할 수 있다.For example, the first network of each encoder may further include a level normalization layer and a feedforward network layer.

예를 들어, 제3 획득 유닛(703)은, 위치 정보가 내장된 현재 인코더에 상응하는 이미지 특징 행렬, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬 및 현재 인코더에 상응하는 얼굴 선험 특징 행렬을 각각 쿼리 벡터, 키 벡터 및 값 벡터로서 교차 어텐션 모듈에 입력하여, 캐스케이드된 교차 어텐션 모듈, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 통해 현재 인코더의 융합된 이미지 특징 행렬을 획득할 수 있다.For example, the third acquisition unit 703 may include an image feature matrix corresponding to the current encoder with embedded position information, a facial a priori feature matrix corresponding to the current encoder with embedded position information, and a facial a priori feature matrix corresponding to the current encoder with embedded position information. By inputting the matrices as the query vector, key vector, and value vector into the cross-attention module, respectively, the fused image feature matrix of the current encoder can be obtained through the cascaded cross-attention module, layer normalization layer, and feedforward network layer.

예를 들어, 제3 획득 유닛(703)은, 현재 인코더의 융합된 이미지 특징 행렬 중 각 특징의 정규화 위치를 결정하고 - 이때, 정규화 위치는 대응하는 특징 맵 내 각 특징에 대응하는 상기 특징 맵 내 특징의 정규화 위치를 나타냄 -, 멀티 레벨 특징 맵의 각 특징 맵에서 미리 설정된 규칙에 따라 각 특징의 정규화 위치 근처의 K개의 정규화 위치를 결정하고, 현재 인코더의 융합된 이미지 특징 행렬에서 멀티 레벨 특징 맵의 각 특징 맵의 K개의 정규화 위치에 상응하는 L*K개의 특징에 대해 가중 합산을 수행하여 현재 인코더의 융합된 이미지 특징 행렬 중 각 특징에 대응하는 특징을 현재 인코더의 업데이트된 이미지 특징 행렬 중의 특징으로 획득할 수 있다.　 이때, L은 멀티 레벨 특징 맵의 특징 맵 개수일 수 있다.For example, the third acquisition unit 703 determines the normalization position of each feature in the fused image feature matrix of the current encoder - where the normalization position is in the feature map corresponding to each feature in the corresponding feature map. Indicates the normalization position of the feature -, Determine K normalization positions near the normalization position of each feature according to preset rules in each feature map of the multi-level feature map, and determine the multi-level feature map in the fused image feature matrix of the current encoder. Weighted summation is performed on the L*K features corresponding to the K normalized positions of each feature map, and the features corresponding to each feature in the fused image feature matrix of the current encoder are combined with the features in the updated image feature matrix of the current encoder. It can be obtained with At this time, L may be the number of feature maps of the multi-level feature map.

예를 들어, 각 인코더의 제2 네트워크는 셀프 어텐션 모듈을 더 포함하고, 이때, 제3 획득 유닛은, 현재 인코더에 상응하는 얼굴 선험 특징 행렬에 기초하여, 셀프 어텐션 모듈을 이용하여 현재 인코더에 상응하는 셀프 어텐션 얼굴 선험 특징 행렬을 획득하고, 현재 인코더에 상응하는 셀프 어텐션 얼굴 선험 특징 행렬 및 현재 인코더의 융합된 이미지 특징 행렬에 기초하여, 제1 변형 가능한 어텐션 모듈을 이용하여 현재 인코더의 업데이트된 얼굴 선험 특징 행렬을 획득할 수 있다.For example, the second network of each encoder further includes a self-attention module, wherein the third acquisition unit determines the corresponding to the current encoder using the self-attention module, based on the facial a priori feature matrix corresponding to the current encoder. Obtain a self-attention face a priori feature matrix, and based on the self-attention face a priori feature matrix corresponding to the current encoder and the fused image feature matrix of the current encoder, use the first deformable attention module to generate an updated face of the current encoder. An a priori feature matrix can be obtained.

예를 들어, 셀프 어텐션 모듈은 캐스케이드된 셀프 어텐션 레이어, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 포함할 수 있다.　 예를 들어, 제3 획득 유닛(703)은, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬, 위치 정보가 내장된 현재 인코더에 상응하는 얼굴 선험 특징 행렬 및 현재 인코더에 상응하는 얼굴 선험 특징 행렬을 각각 쿼리 행렬, 키 행렬 및 값 행렬로서 셀프 어텐션 레이어에 입력하여, 캐스케이드된 셀프 어텐션 레이어, 레이어 정규화 레이어 및 피드포워드 네트워크 레이어를 통해 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬을 획득하고, 마지막 레벨 특징 맵에서 현재 인코더의 셀프 어텐션 얼굴 선험 특징 행렬 중 각 특징의 정규화 위치를 결정하고 - 이때, 정규화 위치는 마지막 레벨 특징 맵에서 각 특징에 상응하는 마지막 레벨 특징 맵에서의 특징의 정규화 위치를 나타냄 -, 미리 설정된 규칙에 따라 최종 레벨 특징 맵에서 정규화 위치 근처의 K개의 정규화 위치를 결정하고, 현재 인코더의 업데이트된 이미지 특징 행렬에서 K개의 정규화 위치에 상응하는 K개의 특징을 결정하고, K개의 특징에 가중치를 합산하여 셀프 어텐션 얼굴 선험 특징 행렬 중 각 특징에 상응하는 특징을 현재 인코더의 업데이트된 얼굴 선험 특징 행렬의 특징으로서 획득할 수 있다.For example, the self-attention module may include a cascaded self-attention layer, a layer normalization layer, and a feedforward network layer. For example, the third acquisition unit 703 may include a facial a priori feature matrix corresponding to the current encoder with embedded position information, a facial a priori feature matrix corresponding to the current encoder with embedded position information, and a facial a priori feature matrix corresponding to the current encoder with embedded position information. Input the feature matrix into the self-attention layer as the query matrix, key matrix, and value matrix, respectively, to obtain the self-attention face a priori feature matrix of the current encoder through the cascaded self-attention layer, layer normalization layer, and feedforward network layer, and finally In the level feature map, determine the normalization position of each feature among the self-attention face a priori feature matrix of the current encoder - at this time, the normalization position represents the normalization position of the feature in the last level feature map corresponding to each feature in the last level feature map. -, Determine K normalization positions near the normalization positions in the final level feature map according to preset rules, determine K features corresponding to the K normalization positions in the updated image feature matrix of the current encoder, and K features By adding the weights, the features corresponding to each feature in the self-attention facial a priori feature matrix can be obtained as features of the updated facial a priori feature matrix of the current encoder.

본 개시의 실시예에 따르면, 전자 장치를 제공하고, 전자 장치는, 프로세서 및 컴퓨터 프로그램을 저장하는 메모리를 포함하고, 컴퓨터 프로그램은 프로세서에 의해 실행될 때, 얼굴 이미지 처리 방법을 구현할 수 있다.According to an embodiment of the present disclosure, an electronic device is provided, the electronic device includes a processor and a memory storing a computer program, and the computer program, when executed by the processor, can implement a facial image processing method.

일 실시예에 따르면, 전자 장치(700)는 메모리(도면 미도시) 및 프로세서(도면 미도시)를 포함할 수 있다.According to one embodiment, the electronic device 700 may include a memory (not shown) and a processor (not shown).

메모리는 컴퓨터에서 읽을 수 있는 인스트럭션들을 포함할 수 있다.　 프로세서는 메모리에 저장된 인스트럭션들이 프로세서에서 실행됨에 따라 앞서 설명한 동작들을 수행할 수 있다.　 메모리는 휘발성 메모리 또는 비휘발성 메모리일 수 있다.Memory may contain instructions that can be read by the computer. The processor can perform the operations described above as instructions stored in memory are executed on the processor. The memory may be volatile memory or non-volatile memory.

프로세서는 인스트럭션들, 혹은 프로그램들을 실행하거나, 전자 장치(700)를 제어하는 장치로서, 예를 들어, CPU(Central Processing Unit), GPU(Graphic Processing Unit), NPU(Neural Processing Unit)을 포함할 수 있으나, 전술한 예에 한정되지 않는다.A processor is a device that executes instructions or programs or controls the electronic device 700 and may include, for example, a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), or a Neural Processing Unit (NPU). However, it is not limited to the examples described above.

그 밖에, 전자 장치(700)에 관해서는 상술된 동작을 처리할 수 있다.In addition, the electronic device 700 can process the operations described above.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다.　 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다.　 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, and a field programmable gate (FPGA). It may be implemented using a general-purpose computer or a special-purpose computer, such as an array, programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 저장될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. It can be saved in . Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. A computer-readable medium may store program instructions, data files, data structures, etc., singly or in combination, and the program instructions recorded on the medium may be specially designed and constructed for the embodiment or may be known and available to those skilled in the art of computer software. there is. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or multiple software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

In a method of operating an electronic device,
Obtaining an initial image feature matrix of the face image based on a multi-level feature map of the face image;
Obtaining an initial facial a priori feature matrix of the face image based on the last level feature map of the multi-level feature map; and
An operation of obtaining a super-resolution image of the face image and/or key point coordinates of the face image using one or more cascaded encoders, based on the initial image feature matrix and the initial facial a priori feature matrix.
Including,
How electronic devices work.

According to paragraph 1,
The operation of acquiring a super-resolution image of the face image is
Obtaining a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in the one or more encoders;
Obtaining an updated image feature matrix of the face image based on the fused image feature matrix using a first deformable attention module included in the one or more encoders; and
Obtaining a super-resolution image of the face image based on the updated image feature matrix and the face image
Including,
How electronic devices work.

According to paragraph 2,
The operation of acquiring the key point coordinates of the face image is
Obtaining a fused image feature matrix based on the initial image feature matrix and the initial facial a priori feature matrix using a cross attention module included in the one or more encoders;
Obtaining updated facial a priori features of the face image based on the fused image feature matrix and the initial facial a priori feature matrix using a second deformable attention module included in the one or more encoders;
An operation of predicting key point coordinates of the face image based on the updated facial a priori features and initial key point coordinates of the face image - initial key point coordinates of the face image are obtained based on the initial facial a priori feature matrix became -;
Including,
How electronic devices work.

According to paragraph 3,
Each of the one or more encoders includes a first network, a second network, and a third network, the first network including a cross attention module, the second network including a first deformable attention module, and the third network includes a second deformable attention module,
The operation of acquiring a super-resolution image of the face image and/or key point coordinates of the face image
For each encoder, based on the image feature matrix and the facial a priori feature matrix corresponding to the current encoder, obtaining a fused image feature matrix of the current encoder using a first network;
Obtaining, using a second network, an updated facial a priori feature matrix of the current encoder, based on the fused image feature matrix of the current encoder and a facial a priori feature matrix corresponding to the current encoder;
Obtaining an updated image feature matrix of the current encoder using a third network, based on the fused image feature matrix of the current encoder; and
Obtaining a super-resolution image of the face image based on the face image and the updated image feature matrix of the last encoder of the one or more encoders and/or the updated face a priori feature matrix of the last encoder and the initial image of the face image An operation of predicting key point coordinates of the face image based on key point coordinates
Including,
When the current encoder is the first encoder among one or more encoders, the image feature matrix corresponding to the current encoder is the initial image feature matrix, and the face a priori feature matrix corresponding to the current encoder is the initial face a priori feature matrix,
When the current encoder is not the first encoder, the image feature matrix corresponding to the current encoder is the updated image feature matrix of the previous encoder of the current encoder, and the facial a priori feature matrix corresponding to the current encoder is the updated image feature matrix of the previous encoder of the current encoder. The updated facial a priori feature matrix of the previous encoder,
How electronic devices work.

According to paragraph 4,
The operation of acquiring a super-resolution image of the face image and/or key point coordinates of the face image,
Obtaining a first offset using an upsampling amplification network based on an updated image feature matrix corresponding to the last cascaded encoder, and obtaining the super-resolution image based on the first offset and the face image; and/or
Based on the updated facial a priori feature matrix corresponding to the last cascaded encoder, a second offset is obtained using a key point prediction network, and the face predicted based on the second offset and the initial key point coordinates of the face image An operation to obtain key point coordinates of an image
Including,
How electronic devices work.

According to clause 4,
The first network of each encoder further includes a layer normalization layer and a feedforward network layer,
The operation of acquiring the fused image feature matrix of the current encoder is
Cross the image feature matrix corresponding to the current encoder with embedded position information, the face a priori feature matrix corresponding to the current encoder with embedded position information, and the face a priori feature matrix corresponding to the current encoder as the query vector, key vector, and value vector, respectively. An operation to obtain the fused image feature matrix of the current encoder through the cascaded cross attention module, layer normalization layer, and feedforward network layer by inputting it to the attention module.
Including,
How electronic devices work.

According to clause 4,
The operation of obtaining an updated image feature matrix of the current encoder using a third network based on the fused image feature matrix of the current encoder is
Determining a normalized position of each feature in the fused image feature matrix of the current encoder, wherein the normalized position indicates a normalized position of a feature in the feature map corresponding to each feature in the corresponding feature map;
determining K normalization positions near the normalization positions of each feature according to a preset rule in each feature map of the multi-level feature map;
Weighted sum is performed on L*K features corresponding to K normalized positions of each feature map of the multi-level feature map in the fused image feature matrix of the current encoder, and each feature in the fused image feature matrix of the current encoder is Obtaining the corresponding feature as a feature in the updated image feature matrix of the current encoder, where L is the number of feature maps of the multi-level feature map;
Including,
How electronic devices work.

According to clause 4,
The second network of each encoder further includes a self-attention module,
The operation of obtaining the updated facial a priori feature matrix of the current encoder is
Based on the facial a priori feature matrix corresponding to the current encoder, obtaining a self-attention facial a priori feature matrix corresponding to the current encoder using a self-attention module; and
Based on the self-attention facial a priori feature matrix corresponding to the current encoder and the fused image feature matrix of the current encoder, obtaining an updated facial a priori feature matrix of the current encoder using a first deformable attention module.
Including,
How electronic devices work.

According to clause 8,
The self-attention module includes a cascaded self-attention layer, a layer normalization layer, and a feedforward network layer,
The operation of acquiring the self-attention face a priori feature matrix of the current encoder is
The facial a priori feature matrix corresponding to the current encoder with embedded position information, the facial a priori feature matrix corresponding to the current encoder with embedded position information, and the facial a priori feature matrix corresponding to the current encoder as the query matrix, key matrix, and value matrix, respectively. An operation to obtain the self-attention face a priori feature matrix of the current encoder by inputting it into the self-attention layer and through the cascaded self-attention layer, layer normalization layer, and feedforward network layer.
Including,
The operation of acquiring the updated facial a priori feature matrix of the current encoder using the second deformable attention module is
An operation of determining the normalization position of each feature among the self-attention face a priori feature matrix of the current encoder in the last level feature map - the normalization position is the normalization position of the feature in the last level feature map corresponding to each feature in the last level feature map. Indicates location -;
Determining K normalization positions near the normalization positions in the final level feature map according to a preset rule; and
Determine K features corresponding to the K normalized positions in the updated image feature matrix of the current encoder, add weights to the K features, and select the current features corresponding to each feature in the self-attention facial a priori feature matrix. Obtaining motion as a feature of the encoder's updated facial a priori feature matrix.
Including,
How electronic devices work.

A computer-readable recording medium storing a computer program that executes the method of any one of claims 1 to 9.

In electronic devices,
processor; and
Memory that stores instructions
Including,
When the instructions are executed by the processor, causing the electronic device to execute the method of any one of claims 1 to 9,
Electronic devices.

In electronic devices,
a first acquisition unit configured to acquire an initial image feature matrix of the face image based on a multi-level feature map of the face image;
a second acquisition unit configured to acquire an initial facial a priori feature matrix of the face image based on the last level feature map of the multi-level feature map; and
A third acquisition unit configured to acquire a super-resolution image of the face image and/or key point coordinates of the face image using the initial image feature matrix and one or more encoders cascaded based on the initial facial a priori feature matrix.
Including,
Electronic devices.