KR20230071720A

KR20230071720A - Method of predicting landmark coordinates of facial image and Apparatus thereof

Info

Publication number: KR20230071720A
Application number: KR1020220130309A
Authority: KR
Inventors: 즈동 구어; 한재준; 이선민; 한승주; 후이 리
Original assignee: 삼성전자주식회사
Priority date: 2021-11-16
Filing date: 2022-10-12
Publication date: 2023-05-23
Also published as: CN114299563A

Abstract

얼굴 이미지의 랜드마크(landmark) 좌표를 예측하는 방법 및 장치를 제공한다. 상기 방법은, 컨볼루션 신경망 레이어를 통해 얼굴 이미지의 멀티 스테이지(multi-stage) 특징맵을 획득하는 단계; 완전 연결 레이어를 통해 상기 멀티 스테이지 특징맵의 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬을 획득하는 단계 - 여기에서, 초기 쿼리 행렬은 얼굴 이미지의 랜드마크의 초기 특징을 나타내고, 여기에서, 초기 쿼리 행렬 중 엘리먼트의 개수는 얼굴 이미지의 랜드마크 수량과 같음 -; 멀티 스테이지 특징맵을 평면화(flattening)하고 연결하여 메모리 특징 행렬을 획득하는 단계; 메모리 특징 행렬 및 초기 쿼리 행렬을 캐스케이드의 적어도 하나의 디코더 레이어에 입력하여 얼굴 이미지의 랜드마크 좌표를 결정하는 단계를 포함한다.A method and apparatus for predicting landmark coordinates of a face image are provided. The method may include acquiring a multi-stage feature map of a face image through a convolutional neural network layer; Obtaining an initial query matrix by fully connecting the last stage feature maps of the multi-stage feature maps through a fully connected layer, wherein the initial query matrix represents initial features of landmarks of a face image, wherein the initial query matrix The number of elements in the matrix is equal to the number of landmarks in the face image -; obtaining a memory feature matrix by flattening and concatenating the multi-stage feature maps; and inputting the memory feature matrix and the initial query matrix to at least one decoder layer of the cascade to determine landmark coordinates of the face image.

Description

Method of predicting landmark coordinates of facial image and Apparatus thereof}

본 개시는 얼굴 랜드마크(landmark) 검출 기술 분야에 관한 것으로, 구체적으로, 얼굴 이미지의 랜드마크 좌표를 예측하는 방법 및 장치에 관한 것이다.The present disclosure relates to the field of face landmark detection technology, and more specifically, to a method and apparatus for predicting landmark coordinates of a face image.

얼굴 랜드마크 검출은 얼굴 정렬이라고도 하고, 얼굴 이미지의 기준 랜드마크를 자동으로 찾는 것을 목표로 한다. 이는 얼굴 인식, 표정 분석, 얼굴 정면화 및 3D 얼굴 재구성과 같은 얼굴 분석 작업의 중요한 부분이다.Facial landmark detection, also referred to as face alignment, aims to automatically find reference landmarks in face images. It is an important part of face analysis tasks such as face recognition, expression analysis, face frontalization and 3D face reconstruction.

최근 심층 신경망 기술의 급속한 발전으로 얼굴 랜드마크 검출은 크게 향상되었다. 기존의 얼굴 랜드마크 검출 방법은 크게, 좌표 회귀(coordinates regression) 기반 방법과 히트맵(heat map) 회귀 기반 방법이 있다Recent rapid development of deep neural network technology has greatly improved facial landmark detection. Existing facial landmark detection methods are largely divided into a coordinates regression-based method and a heat map regression-based method.

좌표 회귀 기반 방법은, 입력 이미지를 랜드마크 좌표(landmark coordinates)에 직접 매핑(mapping)한다. 딥러닝 프레임워크를 사용하는 모델에서는, 입력 이미지를 CNN 모델에 입력하여 이미지 특징(features)을 얻는다. 그런 다음(then), 완전 연결(fully connected) 예측(prediction) 레이어를 통해 이미지 특징을 좌표에 직접 매핑한다. 검출 정확도를 높이기 위해, 좌표 회귀 모듈은 일반적으로 히트맵 회귀 모듈과 계단식으로(step-wise) 구성되거나 통합된다.Coordinate regression-based methods directly map input images to landmark coordinates. In models using deep learning frameworks, input images are fed into a CNN model to obtain image features. Then, map image features directly to coordinates via a fully connected prediction layer. To increase the detection accuracy, the coordinate regression module is usually configured or integrated step-wise with the heatmap regression module.

히트맵 회귀 기반 방법은, 먼저 GT(ground-truth) 값의 프록시(proxy)로 사용되는 주어진 랜드마크 좌표를 기반으로 히트맵을 생성한다. 여기에서, 각 히트맵은 랜드마크 위치의 확률을 나타낸다. 히트맵 회귀에 기반한 모델은 일반적으로 전체 컨볼루션(convolution) 네트워크를 통해 히트맵을 예측한 다음, 히트맵의 피크값 확률 위치에 따라 랜드마크를 얻는다. 히트맵 기반의 모델은 이미지 특징의 공간 구조(spatial structure)를 유지할 수 있으므로, 일반적으로 좌표 회귀 기반 모델보다 뛰어난 성능을 갖는다.The heatmap regression-based method first generates a heatmap based on given landmark coordinates used as a proxy for ground-truth (GT) values. Here, each heat map represents a probability of a landmark location. Models based on heatmap regression usually predict the heatmap through a full convolution network, and then obtain landmarks according to the probabilities of peak values in the heatmap. Heatmap-based models generally outperform coordinate regression-based models because they can maintain the spatial structure of image features.

히트맵 회귀 기반 방법은 상대적으로 정확도가 높지만, 1) 히트맵 확률 출력을 좌표에 매핑하려면 후처리 단계가 필요하지만, 후처리 단계는 미분할 수 없으므로, 전체 프레임워크의 종단간(end-to-end) 트레이닝(training)을 비활성화한다. 2) 계산의 복잡성(complexity)을 고려할 때 히트맵의 해상도는 일반적으로 입력 이미지의 해상도보다 낮으므로, 불가피하게 양자화 오류가 발생하게 되고 추가 개선이 제한된다. 3) 수동 설계된 GT값 히트맵의 생성은 튜닝할 휴리스틱 하이퍼파라미터(heuristic hyperparameters)를 도입한다.Although the heatmap regression-based method has relatively high accuracy, 1) a post-processing step is required to map the heatmap probability output to coordinates, but the post-processing step is non-differentiable, so the end-to-end end) Disable training. 2) Considering the complexity of calculation, since the resolution of the heat map is generally lower than the resolution of the input image, quantization errors inevitably occur and further improvement is limited. 3) Generation of a hand-designed GT value heatmap introduces heuristic hyperparameters to tune.

좌표 회귀 기반 방법은, 위에서 설명한 제한을 우회하고, 종단간 모델 트레이닝을 구현할 수 있다. 이때, 완전 연결 예측 레이어를 사용하면 이미지 특징의 공간 구조가 파괴될 수 있다. 전역(global) 특징에서 랜드마크 좌표로의 매핑은 특징과 예측이 정렬되지 않은 블랙박스와 같아서, 위치 지정 성능이 저하될 수 있다.Coordinate regression-based methods can bypass the limitations described above and implement end-to-end model training. In this case, if the fully connected prediction layer is used, the spatial structure of image features may be destroyed. The mapping from global features to landmark coordinates is like a black box where features and predictions are not aligned, which can degrade positioning performance.

얼굴 이미지의 랜드마크 좌표를 보다 정확하게 예측할 수 있는 방법 및 장치가 요구되고, 해당 방법 및 장치는 히트맵 기반에 비해 휴리스틱 후처리 필요 없이 종단간 트레이닝이 잘 수행될 수 있다. 또한, 랜드마크 주변의 좌표 예측을 위한 대다수 관련 이미지 특징을 추출할 수 있다.There is a need for a method and device capable of more accurately predicting landmark coordinates of a face image, and the method and device can perform end-to-end training well without the need for heuristic post-processing compared to heatmap-based methods and devices. In addition, most relevant image features for coordinate prediction around landmarks can be extracted.

본 개시의 일 실시예에 따르면, 얼굴 이미지의 랜드마크 좌표를 예측하는 방법이 제공되고, 상기 방법은, 컨볼루션 신경망 레이어를 통해, 얼굴 이미지의 멀티 스테이지(multi-stage) 특징맵을 획득하는 단계; 완전 연결 레이어를 통해, 상기 멀티 스테이지 특징맵의 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬을 획득하는 단계 - 상기 초기 쿼리 행렬은 상기 얼굴 이미지의 랜드마크의 초기 특징(initial features)을 나타내고, 상기 초기 쿼리 행렬 중 엘리먼트(element) 개수는 상기 얼굴 이미지의 상기 랜드마크 개수와 동일함 -; 상기 멀티 스테이지 특징맵을 평면화(flattening) 및 연결하여 메모리 특징 행렬을 획득하는 단계; 및 상기 메모리 특징 행렬 및 상기 초기 쿼리 행렬을, 캐스케이드된(cascaded) 적어도 하나의 디코더 레이어에 입력하여 상기 얼굴 이미지의 상기 랜드마크 좌표를 결정하는 단계를 포함할 수 있다.According to an embodiment of the present disclosure, a method of predicting landmark coordinates of a face image is provided, and the method includes acquiring a multi-stage feature map of the face image through a convolutional neural network layer. ; Obtaining an initial query matrix by fully connecting last stage feature maps of the multi-stage feature maps through a fully connected layer, the initial query matrix representing initial features of landmarks of the face image, The number of elements in the initial query matrix is equal to the number of landmarks of the face image -; obtaining a memory feature matrix by flattening and concatenating the multi-stage feature maps; and determining the landmark coordinates of the face image by inputting the memory feature matrix and the initial query matrix to at least one cascaded decoder layer.

선택적으로, 상기 디코더 레이어는, 캐스케이드된 셀프 어텐션 모듈 레이어, 변형 가능(deformable) 어텐션 모듈 레이어 및 랜드마크 좌표 예측 레이어를 포함하고, 상기 얼굴 이미지의 상기 랜드마크 좌표를 결정하는 단계는, 위치 정보가 임베드된 초기 쿼리 행렬 및 초기 쿼리 행렬을, 쿼리 행렬, 키 행렬 및 값 행렬로서 제1 디코더 레이어의 셀프 어텐션 모듈 레이어에 입력하는 단계; 현재(current) 디코더 레이어의 셀프 어텐션 모듈 레이어의 출력 행렬, 메모리 특징 행렬 및 이전(previous) 디코더 레이어에서 예측한 랜드마크 좌표를 상기 현재 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어에 입력함으로써, 변형 가능 어텐션 모듈 레이어의 출력 행렬을 획득하는 단계 - 상기 셀프 어텐션 모듈 레이어의 상기 출력 행렬 및 상기 메모리 특징 행렬은, 상기 변형 가능 어텐션 모듈 레이어의 상기 쿼리 행렬 및 상기 값 행렬이고, 상기 제1 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어에 입력된 이전 디코더 레이어에서 예측한 상기 랜드마크 좌표는 상기 초기 쿼리 행렬을 기반으로 획득한 초기(initial) 랜드마크 좌표임 -; 상기 현재 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 상기 출력 행렬 및 상기 위치 정보가 임베드된 상기 변형 가능 어텐션 모듈 레이어의 상기 출력 행렬을, 캐스케이드된 다음(next) 디코더 레이어의 셀프 어텐션 모듈 레이어의 값 행렬, 쿼리 행렬 및 키 행렬로서 상기 다음 디코더 레이어의 상기 셀프 어텐션 모듈 레이어에 입력하는 단계; 및 상기 현재 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 상기 출력 행렬을, 상기 현재 디코더 레이어의 랜드마크 좌표 예측 레이어에 입력하여, 상기 현재 디코더 레이어에서 예측한 상기 얼굴 이미지의 상기 랜드마크 좌표를 획득하는 단계 - 마지막(last) 디코더 레이어에서 예측한 얼굴 이미지의 랜드마크 좌표는 최종 얼굴 이미지의 랜드마크 좌표임 - 를 포함할 수 있다.Optionally, the decoder layer includes a cascaded self-attention module layer, a deformable attention module layer, and a landmark coordinate prediction layer, and the step of determining the landmark coordinates of the face image includes location information inputting the embedded initial query matrix and the initial query matrix to the self-attention module layer of the first decoder layer as a query matrix, a key matrix, and a value matrix; Transformable Attention obtaining an output matrix of a module layer, wherein the output matrix and the memory feature matrix of the self-attention module layer are the query matrix and the value matrix of the transformable attention module layer, and the transformation of the first decoder layer The landmark coordinates predicted in the previous decoder layer input to the possible attention module layer are initial landmark coordinates obtained based on the initial query matrix -; The output matrix of the transformable attention module layer of the current decoder layer and the output matrix of the transformable attention module layer in which the position information is embedded are cascaded into a value matrix of a self-attention module layer of the next decoder layer. , as a query matrix and a key matrix, inputting to the self-attention module layer of the next decoder layer; and obtaining the landmark coordinates of the face image predicted by the current decoder layer by inputting the output matrix of the deformable attention module layer of the current decoder layer to a landmark coordinate prediction layer of the current decoder layer. Step - landmark coordinates of the face image predicted by the last decoder layer are landmark coordinates of the final face image.

선택적으로, 상기 디코더 레이어의 상기 셀프 어텐션 모듈 레이어의 상기 출력 행렬은,Optionally, the output matrix of the self-attention module layer of the decoder layer,

로 획득되고, 여기에서,

는 출력 행렬의 i번째 행 벡터를 나타내고,

는 셀프 어텐션 모듈 레이어에 입력된 쿼리 행렬의 i번째 행 벡터와 셀프 어텐션 모듈 레이어에 입력된 키 행렬의 j번째 행 벡터의 내적(dot product)을 정규화(normalization)하여 획득한 어텐션 가중치를 나타내고,

는 초기 쿼리 행렬 또는 이전 디코더 레이어의 변형 어텐션 레이어의 출력 행렬의 j번째 행 벡터를 나타내고, N은 얼굴 이미지의 랜드마크 개수를 나타낼 수 있다.is obtained as, where,

denotes the i-th row vector of the output matrix,

Represents the attention weight obtained by normalizing the dot product of the i-th row vector of the query matrix input to the self-attention module layer and the j-th row vector of the key matrix input to the self-attention module layer,

represents a j-th row vector of an initial query matrix or an output matrix of a transformation attention layer of a previous decoder layer, and N may represent the number of landmarks in a face image.

선택적으로, 상기 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 상기 출력 행렬은,Optionally, the output matrix of the transformable attention module layer of the decoder layer,

로 획득되고, 여기에서,

는 출력 행렬의 i번째 랜드마크의 업데이트된 특징을 나타내고,

는 변형 가능 어텐션 모듈 레이어에 입력된 쿼리 행렬에 대해 완전 연결 연산(fully connected operation) 및 softmax 연산을 수행하여 획득한 어텐션 가중치를 나타내고,

는 메모리 특징 행렬 중 k번째 기준점 좌표에 상응하는 특징을 나타내고, k번째 기준점 좌표와 이전 디코더 레이어에서 예측한 랜드마크 좌표의 i번째 랜드마크 좌표 사이의 위치 오프셋은 변형 가능 어텐션 모듈 레이어에 입력된 쿼리 행렬에 대해 완전 완결 연산을 수행하여 획득되고, K는 미리 설정된 값일 수 있다.is obtained as, where,

denotes the updated feature of the ith landmark of the output matrix,

represents an attention weight obtained by performing a fully connected operation and a softmax operation on the query matrix input to the transformable attention module layer,

denotes a feature corresponding to the k-th reference point coordinate in the memory feature matrix, and the location offset between the k-th reference point coordinate and the i-th landmark coordinate of the landmark coordinate predicted in the previous decoder layer is the query input to the transformable attention module layer. It is obtained by performing a full-complete operation on a matrix, and K may be a preset value.

선택적으로, 상기 디코더 레이어에서 예측한 상기 랜드마크 좌표는, Optionally, the landmark coordinates predicted by the decoder layer,

로 획득되고, y는 현재 디코더 레이어에서 예측한 랜드마크 좌표를 나타내고,

은 이전 디코더 레이어에서 예측한 랜드마크 좌표 또는 초기 랜드마크 좌표를 나타내고,

는

에 대한 y의 오프셋을 나타내는 랜드마크 좌표 예측 레이어의 출력을 나타낼 수 있다.Is obtained as, y represents the landmark coordinates predicted by the current decoder layer,

denotes the landmark coordinates predicted by the previous decoder layer or the initial landmark coordinates,

Is

It can represent the output of the landmark coordinate prediction layer representing the offset of y for .

선택적으로, 상기 컨볼루션 네트워크 레이어, 상기 완전 연결 레이어 및 상기 적어도 하나의 디코더 레이어는 아래 회귀 손실 함수를 기반으로 트레이닝 이미지 샘플을 이용하여 트레이닝을 통해 획득되고, Optionally, the convolutional network layer, the fully connected layer, and the at least one decoder layer are obtained through training using training image samples based on a regression loss function below,

여기에서,

는 회귀 손실을 나타내고,

는 각 디코더 레이어에서 예측한 트레이닝 이미지 샘플의 랜드마크 좌표를 나타내고,

는 트레이닝 이미지 샘플의 실제 랜드마크 좌표를 나타내고,

는 디코더 레이어의 수량이고,

은 디코더 레이어의 인덱스일 수 있다.From here,

represents the regression loss,

Represents the landmark coordinates of training image samples predicted by each decoder layer,

denotes the actual landmark coordinates of the training image sample,

is the quantity of decoder layers,

may be an index of a decoder layer.

본 개시의 또 다른 일 실시예에 따르면, 얼굴 이미지의 랜드마크 좌표를 예측하는 장치가 제공되고, 상기 장치는, 컨볼루션 신경망 레이어를 통해 상기 얼굴 이미지의 멀티 스테이지 특징맵을 획득하고, 완전 연결 레이어를 통해 상기 멀티 스테이지 특징맵의 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬을 획득하며 - 상기 초기 쿼리 행렬은 상기 얼굴 이미지의 랜드마크의 초기 특징을 나타내고, 상기 초기 쿼리 행렬 중 엘리먼트 개수는 상기 얼굴 이미지의 상기 랜드마크 개수와 동일함 -, 상기 멀티 스테이지 특징맵을 평면화하고 연결하여 메모리 특징 행렬을 획득하도록 구성된 인코더; 및 캐스케이드된(cascaded) 적어도 하나의 디코더 레이어를 포함하는 디코더 - 상기 적어도 하나의 디코더 레이어는 상기 인코더에서 수신된 상기 메모리 특징 행렬 및 상기 초기 쿼리 행렬에 기반하여 상기 얼굴 이미지의 상기 랜드마크 좌표를 결정하도록 구성됨 -를 포함할 수 있다.According to another embodiment of the present disclosure, an apparatus for predicting landmark coordinates of a face image is provided, wherein the apparatus obtains a multi-stage feature map of the face image through a convolutional neural network layer, and a fully connected layer. Obtaining an initial query matrix by completely connecting the feature maps of the last stage of the multi-stage feature map through, wherein the initial query matrix represents initial features of landmarks of the face image, and the number of elements in the initial query matrix is the face equal to the number of landmarks in the image - an encoder configured to flatten and concatenate the multi-stage feature maps to obtain a memory feature matrix; and at least one cascaded decoder layer, wherein the at least one decoder layer determines the landmark coordinates of the face image based on the memory feature matrix and the initial query matrix received from the encoder. Configured to - may include.

본 개시의 또 다른 일 실시예에 따르면, 얼굴 이미지의 랜드마크 좌표를 예측하는 장치가 제공되고, 상기 장치는, 프로세서; 및 상기 프로세서에 의해 실행되는 하나 이상의 명령어들을 저장하는 메모리를 포함하고, 상기 프로세서는 상기 명령어들을 실행하여: 컨볼루션 신경망 레이어를 통해, 얼굴 이미지의 멀티 스테이지(multi-stage) 특징맵을 획득하는 동작; 완전 연결 레이어를 통해, 상기 멀티 스테이지 특징맵의 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬을 획득하는 동작 - 상기 초기 쿼리 행렬은 상기 얼굴 이미지의 랜드마크의 초기 특징(initial features)을 나타내고, 상기 초기 쿼리 행렬 중 엘리먼트(element)의 개수는 상기 얼굴 이미지의 상기 랜드마크 개수와 동일함 -; 상기 멀티 스테이지 특징맵을 평면화(flattening) 및 연결하여 메모리 특징 행렬을 획득하는 동작; 및 상기 메모리 특징 행렬 및 상기 초기 쿼리 행렬을, 캐스케이드된(cascaded) 적어도 하나의 디코더 레이어에 입력하여 상기 얼굴 이미지의 상기 랜드마크 좌표를 결정하는 동작을 수행하도록 구성될 수 있다.According to another embodiment of the present disclosure, an apparatus for predicting landmark coordinates of a face image is provided, the apparatus comprising: a processor; and a memory storing one or more commands executed by the processor, wherein the processor executes the commands to obtain a multi-stage feature map of a face image through a convolutional neural network layer. ; Obtaining an initial query matrix by fully connecting last stage feature maps of the multi-stage feature maps through a fully connected layer, wherein the initial query matrix represents initial features of landmarks of the face image, and The number of elements in the initial query matrix is equal to the number of landmarks of the face image -; obtaining a memory feature matrix by flattening and concatenating the multi-stage feature maps; and determining the landmark coordinates of the face image by inputting the memory feature matrix and the initial query matrix to at least one cascaded decoder layer.

선택적으로 상기 장치에서, 상기 디코더 레이어는 캐스케이드된 셀프 어텐션 모듈 레이어, 변형 가능 어텐션 모듈 레이어 및 랜드마크 좌표 예측 레이어를 포함하고, 제1 디코더 레이어의 셀프 어텐션 모듈 레이어는, 위치 정보가 임베드된 초기 쿼리 행렬 및 초기 쿼리 행렬을 기반으로 상기 제1 디코더 레이어의 셀프 어텐션 모듈 레이어의 출력 행렬을 획득하도록 구성되고, 상기 위치 정보가 임베드된 초기 쿼리 행렬 및 초기 쿼리 행렬은 각각 상기 제1 디코더 레이어의 상기 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬로 입력되고, 상기 제1 디코더 레이어 이외의 디코더 레이어의 셀프 어텐션 모듈 레이어는, 캐스케이드된 이전(previous) 디코더 레이어의 변형 가능(deformable) 어텐션 모듈 레이어의 출력 행렬, 및 상기 위치 정보가 임베드된 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬을 기반으로, 현재(current) 디코더 레이어의 변형 가능 어텐션 모듈의 출력 행렬을 획득하도록 구성되고, 상기 이전 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 출력 행렬 및 상기 위치 정보가 임베드된 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬은, 현재 디코터 레이어의 셀프 어텐션 모듈 레이어의 값 행렬, 쿼리 행렬 및 키 행렬이고, 상기 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어는, 상기 현재 디코더 레이어의 상기 셀프 어텐션 모듈 레이어의 출력 행렬, 메모리 특징 행렬 및 캐스케이드된 이전 디코더 레이어에서 예측한 랜드마크 좌표를 기반으로, 상기 현재 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 상기 출력 행렬을 획득하도록 구성되고, 상기 현재 디코터 레이어의 상기 셀프 어텐션 모듈 레이어의 상기 출력 행렬 및 메모리 특징 행렬은 상기 현재 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 쿼리 행렬 및 값 행렬이고, 상기 제1 디코더 레이어의 상기 이전 디코더 레이어에서 예측한 상기 랜드마크 좌표는 상기 초기 쿼리 행렬을 기반으로 획득한 초기 랜드마크 좌표이고, 상기 랜드마크 좌표 예측 레이어는, 상기 현재 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어의 상기 출력 행렬을 기반으로, 상기 디코더 레이어에서 예측한 상기 얼굴 이미지의 상기 랜드마크 좌표를 획득하도록 구성되고, 마지막(last) 디코더 레이어에서 예측한 얼굴 이미지의 랜드마크 좌표는 최종 얼굴 이미지의 랜드마크 좌표일 수 있다.Optionally, in the device, the decoder layer includes a cascaded self-attention module layer, a transformable attention module layer, and a landmark coordinate prediction layer, and the self-attention module layer of the first decoder layer includes an initial query in which location information is embedded. and an initial query matrix to obtain an output matrix of a self-attention module layer of the first decoder layer based on the matrix and the initial query matrix, wherein the initial query matrix and the initial query matrix in which the location information is embedded are respectively the self-attention module layer of the first decoder layer. It is input as the query matrix, key matrix, and value matrix of the attention module layer, and the self-attention module layer of the decoder layer other than the first decoder layer is the deformable attention module layer of the previous decoder layer that is cascaded. Acquire an output matrix of a transformable attention module layer of a current decoder layer based on an output matrix and an output matrix of a transformable attention module layer of a previous decoder layer in which the location information is embedded, and the previous decoder layer The output matrix of the transformable attention module layer and the output matrix of the transformable attention module layer of the previous decoder layer in which the position information is embedded are a value matrix, a query matrix, and a key matrix of the self-attention module layer of the current decoder layer, , The deformable attention module layer of the decoder layer is based on the output matrix of the self-attention module layer of the current decoder layer, the memory feature matrix, and the landmark coordinates predicted by the cascaded previous decoder layer, and obtain the output matrix of the transformable attention module layer of the current decoder layer, wherein the output matrix and memory feature matrix of the self attention module layer of the current decoder layer are queried of the transformable attention module layer of the current decoder layer. matrix and a value matrix, the landmark coordinates predicted by the previous decoder layer of the first decoder layer are initial landmark coordinates obtained based on the initial query matrix, and the landmark coordinate prediction layer is the current decoder Acquiring the landmark coordinates of the face image predicted by the decoder layer based on the output matrix of the deformable attention module layer of the layer, and the landmark of the face image predicted by the last decoder layer. The coordinates may be landmark coordinates of the final face image.

선택적으로 상기 장치에서, 상기 디코더 레이어의 상기 셀프 어텐션 모듈 레이어는, Optionally, in the device, the self-attention module layer of the decoder layer,

로 상기 출력 행렬을 획득하고, 여기에서,

는 출력 행렬의 i번째 행 벡터를 나타내고,

는 셀프 어텐션 모듈 레이어에 입력된 쿼리 행렬의 i번째 행 벡터와 셀프 어텐션 모듈 레이어에 입력된 키 행렬의 j번째 행 벡터의 내적을 정규화하여 획득한 어텐션 가중치를 나타내고,

는 초기 쿼리 행렬 또는 이전 디코더 레이어의 변형 어텐션 레이어의 출력 행렬의 j번째 행 벡터를 나타내고, N은 얼굴 이미지의 랜드마크 개수를 나타낼 수 있다.Obtain the output matrix with, where

denotes the i-th row vector of the output matrix,

denotes an attention weight obtained by normalizing the dot product of the i-th row vector of the query matrix input to the self-attention module layer and the j-th row vector of the key matrix input to the self-attention module layer,

선택적으로 상기 장치에서, 상기 디코더 레이어의 상기 변형 가능 어텐션 모듈 레이어는,Optionally, in the device, the deformable attention module layer of the decoder layer,

로 상기 출력 행렬을 획득하고, 여기에서,

는 변형 가능 어텐션 모듈 레이어에 입력된 쿼리 행렬에 대해 완전 연결 연산 및 softmax 연산을 수행하여 획득한 어텐션 가중치를 나타내고,

는 메모리 특징 행렬 중 k번째 기준점 좌표에 상응하는 특징을 나타내고, k번째 기준점 좌표와 이전 디코더 레이어에서 예측한 랜드마크 좌표의 i번째 랜드마크 좌표 사이의 위치 오프셋은 변형 가능 어텐션 모듈 레이어에 입력된 쿼리 행렬에 대해 완전 완결 연산을 수행하여 획득되고, K는 미리 설정된 값일 수 있다.Obtain the output matrix with, where

denotes the updated feature of the ith landmark of the output matrix,

Represents the attention weight obtained by performing a fully connected operation and a softmax operation on the query matrix input to the transformable attention module layer,

선택적으로 상기 장치에서, 상기 디코더 레이어의 상기 랜드마크 좌표 예측 레이어는,Optionally, in the device, the landmark coordinate prediction layer of the decoder layer,

로 상기 랜드마크 좌표를 예측하고, 여기에서, y는 현재 디코더 레이어에서 예측한 랜드마크 좌표를 나타내고,

는

에 대한 y의 오프셋을 나타내는 랜드마크 좌표 예측 레이어의 출력을 나타낼 수 있다.Predict the landmark coordinates with , where y represents the landmark coordinates predicted by the current decoder layer,

Is

선택적으로, 상기 장치는 아래 회귀 손실 함수를 기반으로 트레이닝 이미지 샘플을 이용하여 트레이닝되고, Optionally, the device is trained using training image samples based on the regression loss function below;

여기에서,

는 회귀 손실을 나타내고,

는 디코더 레이어의 수량이고,

은 디코더 레이어의 인덱스일 수 있다.From here,

represents the regression loss,

denotes the actual landmark coordinates of the training image sample,

is the quantity of decoder layers,

may be an index of a decoder layer.

본 개시의 또 다른 일 실시예에 따르면, 프로세서에 의해 실행될 때 제1항 내지 제6항 중 어느 한 항에 따른 얼굴 이미지의 랜드마크 좌표를 예측하는 방법을 구현하는 컴퓨터 프로그램을 저장하는 컴퓨터 판독 가능 저장 매체가 제공될 수 있다.According to another embodiment of the present disclosure, a computer readable program storing a computer program implementing a method of predicting landmark coordinates of a face image according to any one of claims 1 to 6 when executed by a processor. A storage medium may be provided.

본 발명의 상기 및 기타 목적 및 특징은 예를 예시하는 첨부 도면과 함께 다음의 설명으로부터 더욱 명백해질 것이다. 여기에서,
도 1은 본 개시의 실시예에 따른 얼굴 랜드마크 좌표 예측 방법을 도시한 흐름도이다.
도 2는 본 개시의 실시예에 따른 얼굴 랜드마크 좌표 예측 방법을 도시한 흐름도이다.
도 3은 제1 디코더 레이어의 구조의 예시를 도시한 도면이다.
도 4는 본 개시의 실시예에 따른 얼굴 이미지의 랜드마크 좌표 예측 장치의 구조를 도시한 블록도이다.
도 5는 본 개시의 다른 일 실시예에 따른 얼굴 이미지의 랜드마크 좌표 예측 장치의 구조를 도시한 블록도이다.These and other objects and features of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings illustrating examples. From here,
1 is a flowchart illustrating a method for predicting facial landmark coordinates according to an embodiment of the present disclosure.
2 is a flowchart illustrating a method for predicting facial landmark coordinates according to an embodiment of the present disclosure.
3 is a diagram showing an example of a structure of a first decoder layer.
4 is a block diagram showing the structure of an apparatus for predicting landmark coordinates of a face image according to an embodiment of the present disclosure.
5 is a block diagram showing the structure of an apparatus for predicting landmark coordinates of a face image according to another embodiment of the present disclosure.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 개시의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be changed and implemented in various forms. Therefore, the form actually implemented is not limited only to the specific disclosed embodiment, and the scope of the present disclosure includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

"제1" 또는 "제2" 등의 용어를 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, "제1 구성 요소"는 "제2 구성 요소"로 명명될 수 있고, 유사하게 "제2 구성 요소"는 "제1 구성 요소"로도 명명될 수 있다.Terms such as "first" or "second" may be used to describe various elements, but such terms should only be construed for the purpose of distinguishing one element from another. For example, a “first component” may be termed a “second component,” and similarly, a “second component” may also be termed a “first component.”

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 할 것이다.It should be understood that when a component is referred to as being “connected” to another component, it may be directly connected or connected to the other component, but other components may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 개시에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this disclosure, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but that one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

본 명세서에서 사용된 단수형 "일”, "하나”, "상기” 및 "해당”은 특별히 언급되지 않는 한 복수형도 포함할 수 있음은 당업자에게 있어 자명하다. 본 출원의 실시예에서 사용된 "포함” 및 "함유”라는 용어는 상응하는 특징이 제시된 특징, 정보, 데이터, 단계, 동작, 엘리먼트 및/또는 구성 요소로서 구현될 수 있음을 의미하고, 본 기술분야에서 지원하는 다른 특징, 정보, 데이터, 단계, 동작, 엘리먼트, 구성 엘리먼트 및/또는 이들의 조합 등을 배제하지 않는다. 한 엘리먼트가 다른 엘리먼트에 "연결”되거나 "결합”되었다고 말할 때, 해당 하나의 엘리먼트는 다른 엘리먼트에 직접 연결되거나 결합될 수 있고, 해당 하나의 엘리먼트 및 다른 엘리먼트가 중간 엘리먼트를 통해 연결 관계가 구성될 수도 있다. 또한, 본 명세서에서 "연결” 또는 "결합”은 무선 연결 또는 무선 결합을 포함할 수 있다. 본 명세서에서 "및/또는”의 용어는 해당 용어가 정의한 항목 중 적어도 하나를 나타내며, 예를 들어 "A 및/또는 B"는 "A"로 구현 또는 "A 및 B"로 구현됨을 나타낸다.It is apparent to those skilled in the art that the singular forms "one", "one", "above" and "that" used herein may also include plural forms unless otherwise specified. The terms "comprising" and "containing" used in the embodiments of the present application mean that the corresponding feature can be implemented as the presented feature, information, data, step, operation, element and / or component, and the present description It does not exclude other features, information, data, steps, operations, elements, constituent elements and/or combinations thereof supported by the field. When an element is said to be “connected” or “coupled” to another element, that one element can be directly connected or coupled to another element, and a connection relationship between that one element and the other element will be established through an intermediate element. may be Also, “connection” or “combination” in this specification may include a wireless connection or a wireless combination. In this specification, the term "and/or" indicates at least one of the items defined by the term, and for example, "A and/or B" indicates implementation as "A" or implementation as "A and B".

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 개시에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present disclosure, it should not be interpreted in an ideal or excessively formal meaning. don't

본 개시의 얼굴 이미지의 랜드마크 좌표 예측 방법 및 장치에 따르면, 랜드마크 검출은 N 좌표 예측에 관한 것이고, 여기에서 N은 얼굴 랜드마크 개수로, 예를 들어 눈가 및 입 모서리 등 총 6개의 얼굴 랜드마크, 및 일반적인 68 포인트 및 98 포인트 모델 등이 있다. 셀프 어텐션 모듈을 사용하여 잠재적으로 랜드마크 좌표 간의 구조적 종속성을 학습(learn)하고, 각 랜드마크가 나타내는 특징 벡터를 출력한다. 그런 다음, 전체 이미지 특징(모델의 수렴 속도가 느려짐)을 검색하는 대신 이미지 특징을 기반으로 하는 변형 가능(deformable) 어텐션 모듈을 사용하여 기준점 주변에서 샘플링된 소수의 이미지 특징에 초점을 맞춘다. 여기에서 이전(previous) 레이어의 예측 또는 초기(initial) 예측 랜드마크 좌표와 관련된 정보(즉, 기준점)는 랜드마크 주변에서 가장 관련성이 높은 이미지 특징을 추출하고 좌표를 정제(refine)하기 위한 가이드로 사용된다. 또한, 본 개시에서는 무작위로(randomly) 초기화된 쿼리 행렬(query matrix)을 사용하지 않고, 백본 네트워크의 마지막 레이어에서 완전 연결 방식으로 n개의 특징을 추출하여 초기 쿼리 행렬로 사용한다. 이러한 설계는 모델의 수렴 속도를 높이고 얼굴 이미지의 랜드마크 좌표 예측의 정확도를 더욱 향상시킨다.According to the method and apparatus for predicting landmark coordinates of a facial image of the present disclosure, landmark detection relates to N coordinate prediction, where N is the number of facial landmarks, for example, a total of six facial landmarks such as eye rims and mouth corners. Mark, and common 68-point and 98-point models. Using the self-attention module, we potentially learn structural dependencies between coordinates of landmarks, and output feature vectors represented by each landmark. Then, instead of retrieving the full image features (which slows down the convergence of the model), we focus on a small number of image features sampled around the reference point using a deformable attention module based on the image features. Here, information related to the predicted or initial predicted landmark coordinates of the previous layer (i.e., reference points) serves as a guide for extracting the most relevant image features around the landmark and refining the coordinates. used In addition, in the present disclosure, a randomly initialized query matrix is not used, and n features are extracted in a fully connected manner from the last layer of the backbone network and used as an initial query matrix. This design speeds up the convergence of the model and further improves the accuracy of landmark coordinate prediction in face images.

도 1은 본 개시의 실시예에 따른 얼굴 랜드마크 좌표 예측 방법을 도시한 흐름도이다.1 is a flowchart illustrating a method for predicting facial landmark coordinates according to an embodiment of the present disclosure.

변형 가능한 DETR 모델과 비교하여, 본 개시에 따른 방법은, 초기 쿼리 임베딩(embedding)으로 난수(random numbers)를 사용하는 대신, 완전 연결 레이어를 사용하여 이미지 특징맵(feature map)을 기반으로 보다 의미 있는 특징을 추출하고, 추출된 특징을 초기 쿼리 임베딩으로 사용한다. 여기에서, 쿼리의 개수는 예측에 필요한 랜드마크 개수로 설정되고, 추출된 특징을 기반으로 하는 초기 쿼리 임베딩은 이미지와 관련 없는 표준 랜드마크 템플릿보다는 이미지와 관련된 대략적인 기준점 위치를 제공한다.Compared to the deformable DETR model, the method according to the present disclosure is more semantic based on image feature maps using a fully connected layer instead of using random numbers as the initial query embedding. Extract features that exist, and use the extracted features as initial query embeddings. Here, the number of queries is set to the number of landmarks required for prediction, and the initial query embedding based on the extracted features provides rough reference point locations related to images rather than standard landmark templates not related to images.

또한, 본 개시에 따른 방법은, 변형 가능한 DETR 모델과 비교하여 인코더 레이어를 삭제하고, 컨볼루션 레이어와 완전 연결 레이어만 유지하고, 디코더 레이어의 수를 줄인다(예를 들어, 변형 가능한 DETR은 6개의 디코더 레이어을 포함하고, 본 개시에서는 3개 이하의 디코더 레이어만을 포함할 수 있음). 또한 L1 놈(norm) 손실(예측값과 실제값의 차이를 나타내는 절대값) 함수만을 모델 트레이닝에 사용함으로써, 모델 크기를 줄이고 계산 비용을 절약할 수 있다. In addition, compared to the deformable DETR model, the method according to the present disclosure deletes encoder layers, keeps only convolutional layers and fully connected layers, and reduces the number of decoder layers (e.g., deformable DETR has 6 decoder layer, and may include only three or less decoder layers in the present disclosure). In addition, by using only the L1 norm loss (the absolute value representing the difference between the predicted value and the actual value) function for model training, it is possible to reduce the model size and save computational cost.

도 1을 참조하면, 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬Qinit을 얻고, 멀티 스테이지 특징맵을 평면화하고 연결하여 메모리 특징 행렬을 획득하고, 메모리 특징 행렬 및 초기 쿼리 행렬을 디코더 레이어의 입력으로 하여 디코더 레이어를 통해 얼굴 이미지의 랜드마크 좌표를 예측한다. 이하, 방법의 각 단계에 대해 설명한다.Referring to FIG. 1, an initial query matrix Qinit is obtained by fully concatenating the last stage feature maps, a memory feature matrix is obtained by flattening and concatenating multi-stage feature maps, and the memory feature matrix and the initial query matrix are used as inputs of the decoder layer. Then, the landmark coordinates of the face image are predicted through the decoder layer. Hereinafter, each step of the method will be described.

도 2는 본 개시의 실시예에 따른 얼굴 이미지의 랜드마크 좌표 예측 방법을 도시한 흐름도이다.2 is a flowchart illustrating a method of predicting landmark coordinates of a face image according to an embodiment of the present disclosure.

도 2를 참조하면, 단계(201)에서, 컨볼루션 신경망 레이어를 통해 얼굴 이미지의 멀티 스테이지 특징맵(예, 도 1의 4 스테이지 특징맵)을 획득한다. 본 기술분야의 통상의 지식을 가진 자는, 추출된 멀티 스테이지 특징맵이 피라미드 특징이고, 하위 스테이지 특징맵은 이미지의 로컬 특징을 나타내고, 상위 스테이지 특징맵은 이미지의 전역 특징을 나타낸다는 것을 이해할 수 있다.Referring to FIG. 2 , in step 201, a multi-stage feature map (eg, the 4-stage feature map of FIG. 1) of a face image is acquired through a convolutional neural network layer. Those skilled in the art can understand that the extracted multi-stage feature map is a pyramid feature, the lower stage feature map represents the local feature of the image, and the upper stage feature map represents the global feature of the image. .

예를 들어, 얼굴 이미지의 피라미드 특징은 백본 네트워크를 통해 획득할 수 있다. 본 개시에서는 ResNet18을 사용하나, ResNet18에 국한되지는 않는다. 주어진 입력 이미지 크기는 256×256×3이고, 모든 스테이지 특징맵의 크기는 64×64×64, 32×32×128, 16×16×256, 8×8×512이다.For example, a pyramidal feature of a face image may be acquired through a backbone network. This disclosure uses ResNet18, but is not limited to ResNet18. The size of the given input image is 256×256×3, and the sizes of all stage feature maps are 64×64×64, 32×32×128, 16×16×256, and 8×8×512.

단계(202)에서, 완전 연결 레이어를 통해 상기 멀티 스테이지 특징맵의 마지막 스테이지의 특징맵을 완전 연결하여 초기 쿼리 행렬

을 획득한다. 여기에서, 초기 쿼리 행렬은 얼굴 이미지의 랜드마크의 초기 특징을 나타내고, 여기에서, 초기 쿼리 행렬의 엘리먼트 개수는 얼굴 이미지의 랜드마크 개수와 동일하다.

는 다음과 같이 얻을 수 있다.In step 202, the feature map of the last stage of the multi-stage feature map is fully connected through a fully connected layer to form an initial query matrix.

Acquire Here, the initial query matrix represents initial features of the landmarks of the face image, and the number of elements of the initial query matrix is equal to the number of landmarks of the face image.

can be obtained as:

여기에서, F 마크는 백본의 마지막 레이어로부터의 특징맵이고, 그 크기는 (H×W)×C로 표시되며, 여기서 H와 W는 특징맵의 공간적 너비와 높이이고, C는 특징맵의 특징 차원이다. FC는 완전 연결 레이어를 나타내고, 각 특징 채널(H×W)의 공간 스케일을 크기 N의 벡터에 매핑한다.Here, mark F is a feature map from the last layer of the backbone, and its size is expressed as (H×W)×C, where H and W are the spatial width and height of the feature map, and C is the feature of the feature map. dimension. FC denotes a fully connected layer and maps the spatial scale of each feature channel (H×W) to a vector of size N.

쿼리 행렬Qinit은 N×C의 크기를 갖고, 여기에서 N은 얼굴 이미지의 랜드마크 개수이고, C는 특징 차원이다. 이는 본질적으로 학습 가능한 행렬로서, 랜드마크 관련 특징을 추출하여 좌표로 변환한다.The query matrix Qinit has size N×C, where N is the number of landmarks in the face image and C is the feature dimension. This is essentially a learnable matrix, which extracts landmark-related features and transforms them into coordinates.

단계(203)에서, 멀티 스테이지 특징맵을 평면화하고 연결하여(Flatten & Connect) 메모리 특징 행렬을 획득한다.In step 203, a memory feature matrix is obtained by flattening and connecting the multi-stage feature maps.

예를 들어, 전형적인 인코더-디코더 프레임워크를 사용할 수 있다. 구체적으로, 백본(backbone) 네트워크를 인코더로 사용하여 입력 이미지에서 피라미드(pyramid) 특징을 추출할 수 있다. 그런 다음, 이러한 특징맵에 1×1 컨볼루션을 적용하여 동일한 수의 출력 채널을 갖는 특징맵을 얻는다. 그런 다음, 이러한 특징맵을 메모리 특징 행렬M로 평면화하고 연결한다(Flatten & Connect).For example, a typical encoder-decoder framework can be used. Specifically, a pyramid feature can be extracted from an input image using a backbone network as an encoder. Then, 1×1 convolution is applied to these feature maps to obtain feature maps with the same number of output channels. Then, these feature maps are flattened into a memory feature matrix M and connected (Flatten & Connect).

단계(204)에서, 초기 쿼리 행렬 및 메모리 특징 행렬을, 캐스케이드된(cascaded) 적어도 하나의 디코더 레이어에 입력하여 얼굴 이미지의 랜드마크 좌표를 결정한다.In step 204, the initial query matrix and the memory feature matrix are input to the cascaded at least one decoder layer to determine landmark coordinates of the face image.

본 개시의 일 실시예에 따르면, 각 디코더 레이어의 입력 행렬과 출력 행렬이 다른 것을 제외하고, 각 디코더 레이어는 동일한 구조를 갖는다.According to an embodiment of the present disclosure, each decoder layer has the same structure except that an input matrix and an output matrix of each decoder layer are different.

예를 들어, 각 디코더 레이어는 캐스케이드된 셀프 어테션 모듈 레이어, 변형 가능 어텐션 모듈 레이어 및 랜드마크 좌표 예측 레이어을 포함한다. 본 기술분야의 통상의 지식을 가진 자는 각 디코더 레이어가 필요에 따라 다른 레이어를 추가로 포함할 수 있음을 이해해야 한다.For example, each decoder layer includes a cascaded self-attention module layer, a transformable attention module layer, and a landmark coordinate prediction layer. Those skilled in the art should understand that each decoder layer may additionally include other layers as needed.

예를 들어, 얼굴 이미지의 랜드마크 좌표를 결정하는 단계는, 위치 정보가 임베드된 초기 쿼리 행렬, 위치 정보가 임베드된 초기 쿼리 행렬 및 초기 쿼리 행렬을 제1 디코더 레이어의 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬로 제1 디코더 레이어의 셀프 어텐션 모듈 레이어로 입력하는 단계를 포함한다.For example, the step of determining landmark coordinates of a face image may include an initial query matrix in which location information is embedded, an initial query matrix in which location information is embedded, and an initial query matrix as a query matrix of a self-attention module layer of a first decoder layer. , and inputting the key matrix and the value matrix to the self-attention module layer of the first decoder layer.

현재(current) 디코더 레이어의 셀프 어텐션 모듈 레이어의 출력 행렬, 메모리 특징 행렬 및 이전(previous) 디코더 레이어에서 예측한 랜드마크 좌표를 현재 디코더 레이어의 변형 가능 어텐션 모듈 레이어에 입력하여 변형 가능 어텐션 모듈 레이어의 출력 행렬을 획득한다. 여기에서, 셀프 어텐션 모듈 레이어의 출력 행렬 및 메모리 특징 행렬은 변형 가능 어텐션 모듈 레이어의 쿼리 행렬 및 값 행렬이고, 여기에서, 제1 디코더 레이어의 변형 가능 어텐션 모듈 레이어에 입력된 이전 디코더 레이어에서 예측한 랜드마크 좌표는 초기 쿼리 행렬을 기반으로 획득한 초기 랜드마크 좌표이다.By inputting the output matrix of the self-attention module layer of the current decoder layer, the memory feature matrix, and the landmark coordinates predicted by the previous decoder layer into the transformable attention module layer of the current decoder layer, the transformation of the transformable attention module layer Get the output matrix. Here, the output matrix and memory feature matrix of the self-attention module layer are the query matrix and the value matrix of the transformable attention module layer, where predictions from the previous decoder layer input to the transformable attention module layer of the first decoder layer The landmark coordinates are initial landmark coordinates obtained based on the initial query matrix.

예시적 실시예로서, 쿼리 행렬에 대해 완전 연결 연산을 수행함으로써 초기 랜드마크 좌표를 획득할 수 있다.As an exemplary embodiment, initial landmark coordinates may be obtained by performing a fully concatenated operation on a query matrix.

현재 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬 및 위치 정보가 임베드된 변형 가능 어텐션 모듈 레이어의 출력 행렬을, 캐스케이드된 다음(next) 디코더 레이어의 셀프 어텐션 모듈 레이어의 값 행렬, 쿼리 행렬 및 키 행렬로서 다음 디코더 레이어의 셀프 어텐션 모듈 레이어에 입력하고; 현재 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬을 현재 디코더 레이어의 랜드마크 좌표 예측 레이어에 입력하여 현재 디코더 레이어에서 예측한 얼굴 이미지의 랜드마크 좌표를 획득한다. 여기에서, 마지막(last) 디코더 레이어에서 예측한 얼굴 이미지의 랜드마크 좌표는 최종 얼굴 이미지의 랜드마크 좌표이다.The output matrix of the transformable attention module layer of the current decoder layer and the output matrix of the transformable attention module layer in which the position information is embedded are converted to the value matrix, query matrix, and key matrix of the self-attention module layer of the next decoder layer that is cascaded. input to the self-attention module layer of the next decoder layer as ; The output matrix of the deformable attention module layer of the current decoder layer is input to the landmark coordinate prediction layer of the current decoder layer to obtain the landmark coordinates of the face image predicted by the current decoder layer. Here, the landmark coordinates of the face image predicted by the last decoder layer are the landmark coordinates of the final face image.

도 3은 제1 디코더 레이어의 구조의 예시를 도시한 도면으로, 다른 디코더도 동일한 구조를 가질 수 있다.3 is a diagram showing an example of a structure of a first decoder layer, and other decoders may have the same structure.

도 3을 참조하면, 각 디코더 레이어는 셀프 어텐션 모듈 레이어와 변형 가능 어텐션 모듈 레이어을 포함할 수 있다. 각 디코더 레이어의 셀프 어텐션 모듈 레이어는 셀프 어텐션(self-attention) 레이어와 레지듀얼 합산(residual summation) 정규화 레이어(Norm layer)를 포함할 수 있다.Referring to FIG. 3 , each decoder layer may include a self-attention module layer and a transformable attention module layer. The self-attention module layer of each decoder layer may include a self-attention layer and a residual summation normalization layer.

제1 디코더 레이어의 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬은 각각 위치 정보가 임베드된

, 위치 정보가 임베드된

및

이다. 다른 디코더 레이어의 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬은, 각각 위치 정보가 임베드된 이전 레이어 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬 및 이전 레이어 디코더 레이어의 변형 가능 어텐션 레이어의 출력 행렬이다.The query matrix, the key matrix, and the value matrix of the self-attention module layer of the first decoder layer are each embedded with position information.

, with location information embedded

and

am. The query matrix, key matrix, and value matrix of the self-attention module layer of another decoder layer are the output matrix of the transformable attention module layer of the decoder layer of the previous layer and the output of the transformable attention layer of the decoder layer of the previous layer, respectively. it is a matrix

각 디코더 레이어의 변형 가능 어텐션 모듈 레이어는, 변형 가능 어텐션(deformable-attention) 레이어, 레지듀얼(residual) 합산 및 정규화(Add & Norm) 레이어 및 피드포워드(FeedForward) 뉴럴네트워크(FFN) 레이어를 포함할 수 있다.The deformable attention module layer of each decoder layer may include a deformable-attention layer, a residual add & norm layer, and a feedforward neural network (FFN) layer. can

셀프 어텐션 모듈 레이어의 출력 행렬은 변형 가능 어텐션 모듈 레이어(변형 가능 어텐션 레이어)에 대한 쿼리 행렬 입력으로 사용되고, 메모리 특징 행렬은 변형 가능 어텐션 모듈 레이어에 입력되는 값 행렬이다.The output matrix of the self-attention module layer is used as a query matrix input for the transformable attention module layer (transformable attention module layer), and the memory feature matrix is a value matrix input to the transformable attention module layer.

셀프 어텐션 모듈 레이어는 쿼리 행렬 Q(첫 번째 디코더 레이어의 경우

임)만 입력으로 사용한다. 이는 랜드마크 간의 구조적 종속성을 학습한다. 해당 정보는 실제로 이미지와 무관하며, 랜드마크 위치에서 제스처, 표정 등을 캡처하고, 이는 랜드마크 포지셔닝에 있어 중요한 것으로 입증된다. 셀프 어텐션 모듈 레이어는 QP(제1 디코더 레이어가 위치 정보가 임베드된 초기 쿼리 행렬

이고, 다른 디코더 레이어가 이전 디코더 레이어인 위치 정보가 임베드된 변형 가능 어텐션 모듈 레이어에 대한 출력 행렬), QP(제1 디코더 레이어가 위치 정보가 임베드된 초기 쿼리 행렬

이고, 다른 디코더 레이어가 이전 디코더 레이어인 위치 정보가 임베드된 변형 가능 어텐션 모듈 레이어에 대한 출력 행렬) 및 Q(제1 디코더 레이어가

이고, 다른 디코더 레이어가 이전 디코더 레이어인 변형 가능 어텐션 모듈 레이어에 대한 출력 행렬)를 각각 쿼리 행렬, 키 행렬 및 값 행렬으로 취하고, 여기에서 QP=Q+P이고, P는 학습 가능한 위치 임베딩이다. 셀프 어텐션 모듈 레이어의 출력은 QE로 표현되며 다음과 같이 얻을 수 있다.The self-attention module layer is the query matrix Q (for the first decoder layer

i) is used as an input. It learns structural dependencies between landmarks. That information is actually image-independent and captures gestures, facial expressions, etc. at landmark locations, which proves to be important for landmark positioning. The self-attention module layer is a QP (initial query matrix in which the first decoder layer is embedded with positional information)

, where another decoder layer is the previous decoder layer, an output matrix for a transformable attention module layer in which position information is embedded), QP (an initial query matrix in which position information is embedded in the first decoder layer)

, where the other decoder layer is the previous decoder layer, the output matrix for the transformable attention module layer in which positional information is embedded) and Q (the first decoder layer is

, and another decoder layer takes the output matrix for the transformable attention module layer, which is the previous decoder layer) as the query matrix, key matrix, and value matrix, respectively, where QP = Q + P, where P is the learnable positional embedding. The output of the self-attention module layer is expressed as QE and can be obtained as follows.

(1)

(One)

여기에서,

는 출력 행렬의 i번째 행 벡터를 나타내고,

는 셀프 어텐션 모듈 레이어에 입력된 쿼리 행렬의 i번째 행 벡터와 셀프 어텐션 모듈 레이어에 입력된 키 행렬의 j번째 행 벡터의 내적(dot product)을 정규화(normalization)하여 획득한 결과를 나타내고,

는 초기 쿼리 행렬 또는 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬의 j번째 행 벡터를 나타내고, N은 랜드마크 개수를 나타낸다. 예를 들어, 입력 쿼리 행렬 Q, 키 행렬 K 및 값 행렬 V가 모두 N×C 크기 행렬인 경우,

이다.

는 L × L 행렬 α의 (i, j)번째 엘리먼트의 값을 나타내고,

는 키 행렬의 행 벡터의 차원(dimension)이다.From here,

denotes the i-th row vector of the output matrix,

Represents a result obtained by normalizing the dot product of the i-th row vector of the query matrix input to the self-attention module layer and the j-th row vector of the key matrix input to the self-attention module layer,

represents the j-th row vector of the initial query matrix or the output matrix of the transformable attention module layer of the previous decoder layer, and N represents the number of landmarks. For example, if the input query matrix Q, key matrix K, and value matrix V are all matrices of size NxC,

am.

Represents the value of the (i, j)th element of the L × L matrix α,

is the dimension of the row vector of the key matrix.

여기에서, sofmax 연산은 잘 알려진 기술로서 그 의미는 다음과 같다.Here, the sofmax operation is a well-known technique, and its meaning is as follows.

Softmax 연산은 모든 입력 값을 (0,1)로 정규화하고, 모든 입력의 합이 1이 되도록 한다. 공식의 분모는 모든 입력 인덱스의 합을 나타내고, 분자는 특정 값의 지수를 나타낸다.The Softmax operation normalizes all input values to (0,1) and makes the sum of all inputs equal to 1. The denominator of a formula represents the sum of all input indices, and the numerator represents the exponent of a particular value.

각 디코더 레이어의 변형 가능 어텐션 모듈 레이어는 셀프 어텐션 모듈 레이어의 출력 행렬, 메모리 특징 행렬 및 이전 디코더 레이어에서 예측한 랜드마크 좌표 행렬(제1 디코더 레이어의 경우 초기 랜드마크 좌표 행렬임)을 기반으로 랜드마크의 업그레이드된 특징, 즉 변형 가능 어텐션 모듈 레이어의 출력 행렬을 얻는다. 여기에서, 초기 쿼리 행렬을 완전 연결하여 초기 랜드마크 좌표 행렬을 얻는다. 본 기술분야의 통상의 지식을 가진 자는, 본 명세서에서 설명되는 랜드마크 좌표 및 랜드마크 좌표 행렬이 동일하거나 유사한 의미를 갖는다는 것을 이해해야 한다.The transformable attention module layer of each decoder layer is based on the output matrix of the self-attention module layer, the memory feature matrix, and the landmark coordinate matrix predicted by the previous decoder layer (in case of the first decoder layer, this is the initial landmark coordinate matrix). We get the upgraded feature of the mark, i.e. the output matrix of the transformable attention module layer. Here, the initial landmark coordinate matrix is obtained by fully concatenating the initial query matrix. Those skilled in the art should understand that the landmark coordinates and the landmark coordinate matrix described herein have the same or similar meanings.

일 예로, 변형 가능 어텐션 모듈 레이어의 출력 행렬은 다음을 통해 얻을 수 있다:As an example, the output matrix of the transformable attention module layer can be obtained through:

여기에서,

는 i번째 랜드마크의 업데이트된 특징을 나타내고,

는 변형 가능 어텐션 모듈 레이어에 입력된 쿼리 행렬에 대해 전체 연결 연산과 소프트맥스 연산을 수행하여 얻은 어텐션 가중치를 나타내고,

는 메모리 특징 행렬에서 k번째 기준점에 해당하는 특징을 나타내고, 여기에서, k번째 기준점 좌표와 이전 디코더 레이어에서 예측한 랜드마크 좌표에서 i번째 랜드마크 좌표 사이의 위치 오프셋은 변형 가능 어텐션 모듈 레이어에 입력된 쿼리 행렬에 대해 완전 연결 연산을 수행하여 구한 것이다. 여기에서, K는 미리 설정된 값이다. 즉, k번째 기준점의 좌표는 i번째 랜드마크의 좌표에 상기 위치 오프셋을 더한 것으로, 상기 완전 연결 연산의 파라미터 행렬은 K와 관련이 있다. 구체적으로,

는 다음과 같이 구할 수 있다.From here,

represents the updated feature of the ith landmark,

represents the attention weight obtained by performing the full concatenation operation and the softmax operation on the query matrix input to the transformable attention module layer,

denotes a feature corresponding to the k-th reference point in the memory feature matrix, where the position offset between the k-th reference point coordinates and the i-th landmark coordinates from the landmark coordinates predicted in the previous decoder layer is input to the transformable attention module layer. It is obtained by performing a fully concatenated operation on the query matrix. Here, K is a preset value. That is, the coordinates of the k-th reference point are the coordinates of the i-th landmark plus the position offset, and the parameter matrix of the fully connected operation is related to K. Specifically,

can be obtained as:

는

의 k번째 엘리먼트이다. 여기에서,

는 변형 가능 어텐션 모듈 레이어의 입력 쿼리 행렬의 i번째 C차원의 행 벡터를 나타내고,

는 K×C의 행렬로,

에 대해 완전 연결 연산이 수행되는 완전 연결 연산 파라미터 행렬을 나타내며, 이는 학습 가능한 행렬이다. 위와 같이,

와

의 내적을 먼저 획득하고, 그런 다음 softmax 연산을 통해 내적을 정규화하여 어텐션 가중치

를 얻는다.

Is

is the kth element of From here,

Represents the i-th C-dimensional row vector of the input query matrix of the transformable attention module layer,

is a matrix of K × C,

Represents a fully connected operation parameter matrix in which a fully connected operation is performed on , which is a learnable matrix. as above,

and

first obtain the dot product of , and then normalize the dot product through softmax operation to

get

는 초기 랜드마크 좌표(제1 디코더 레이어에 대한) 또는 이전 디코더 레이어에서 예측한 랜드마크 좌표 행렬의 엘리먼트

(즉, 이전 디코더에서 예측한 i번째 랜드마크의 좌표)와 값 행렬 M의 좌표 인덱스로 위치 오프셋을 더하여 얻은 특징(예, 상기 얻은 좌표에 따라, M 중에서 해당 좌표에 해당하는 특징을 결정함)을 나타내고, k의 값이 1, …, K이므로, 각 랜드마크에 대해 M에서 K개의 특징을 인덱싱할 수 있다.

Is the initial landmark coordinate (for the first decoder layer) or an element of the landmark coordinate matrix predicted in the previous decoder layer

(i.e., the coordinates of the i-th landmark predicted by the previous decoder) and the feature obtained by adding the position offset to the coordinate index of the value matrix M (e.g., according to the obtained coordinates, the feature corresponding to the corresponding coordinate among M is determined) , and the value of k is 1, ... , K, so for each landmark, M to K features can be indexed.

여기에서, 위치 오프셋

은 입력된 쿼리 행렬을 통해 완전 연결로 얻은 K개의 기준점의 K번째 기준점(그 좌표는 i번째 기준점의 좌표에 위치 오프셋

을 더한 것임)의 위치와 i번째 랜드마크의 위치 사이의 상대 오프셋을 나타내고, 위치 오프셋은 다음 공식을 통해 얻을 수 있다.where position offset

is the K-th reference point of the K reference points obtained by full connection through the input query matrix (the coordinates are offset to the coordinates of the i-th reference point)

) represents the relative offset between the position of the i-th landmark and the position of the i-th landmark, and the position offset can be obtained through the following formula.

는

의 k번째 엘리먼트를 나타내고, k=1이고, …, K이고,

는 입력된 쿼리 행렬의 i번째 C 차원의 행 벡터를 나타내고,

는 2K×C 크기의 행렬을 나타내고,

에 대해 완전 연결 연산이 수행되는 완전 연결 연산 파라미터 행렬을 나타내며, 이는 학습 가능한 행렬이고, 2는 각 위치가 가로 좌표와 세로 좌표의 두 값으로 구성됨을 나타낸다.

Is

Represents the kth element of , k = 1, . . . , K,

Represents the i-th C-dimensional row vector of the input query matrix,

Represents a matrix of size 2K × C,

Represents a fully connected operation parameter matrix in which a fully connected operation is performed on , which is a learnable matrix, and 2 indicates that each position consists of two values of abscissa and ordinate.

K는 각 랜드마크에 필요한 기준점의 개수를 나타내고, 그 크기는 미리 설정할 수 있다.K represents the number of reference points required for each landmark, and its size can be set in advance.

본 개시에 따른 일 실시예들을 보다 명확하게 이해하기 위해, 위의 디코더 레이어에서 예측한 좌표 중 세 번째 엘리먼트를 예로 들어 설명하고, K=4로 설정한다.In order to more clearly understand the embodiments according to the present disclosure, the third element among the coordinates predicted by the above decoder layer is described as an example, and K=4 is set.

세 번째 엘리먼트는, 즉 이전 레이어 디코더에서 예측한 세 번째 랜드마크의 좌표

(현재 디코더 레이어가 제1 디코더 레이어인 경우,

은 초기 쿼리 행렬Qinit에 대해 완전 연결을 수행하여 얻은 3번째 랜드마크의 초기 좌표임)이다. 미리 설정된 K를 기반으로,

에 대해 완전 연결을 수행하여

를 얻을 수 있고,

는

,

, 및

의 4가지 엘리먼트를 포함한다.

+

하여

에 대응하는 첫 번째 기준점의 좌표를 얻고, 해당 좌표를 사용하여 메모리 특징 행렬의 엘리먼트

를 결정한다. 즉, 메모리 특징 행렬 중 좌표는

+

의 엘리먼트이다. 마찬가지로,

,

+

,

하여

,

와 대응하는 두 번째 기준점, 세 번째 기준점 및 네 번째 기준점을 얻고, 이를 통해 메모리 특징 중의 엘리먼트

,

및

를 결정한다. 최종적으로,

, 즉 세 번째 랜드마크의 업데이트된 특징을 얻을 수 있고, 다른 랜드마크의 업데이트된 특징 또한 동일한 방식으로 얻을 수 있다. 다시 말해, 변형 가능 어텐션 모듈 레이어의 출력 행렬은 얼굴 이미지의 랜드마크의 업데이트된 특징을 나타낸다.The third element is the coordinates of the third landmark predicted by the previous layer decoder.

(If the current decoder layer is the first decoder layer,

is the initial coordinate of the third landmark obtained by performing full linkage on the initial query matrix Qinit). Based on the preset K,

By performing full concatenation on

can be obtained,

Is

,

, and

contains four elements of

+

So

Get the coordinates of the first reference point corresponding to

decide That is, the coordinates of the memory feature matrix are

+

is an element of Likewise,

,

+

,

So

,

Obtain a second reference point, a third reference point, and a fourth reference point corresponding to

,

and

decide Finally,

, that is, the updated features of the third landmark can be obtained, and the updated features of other landmarks can also be obtained in the same way. In other words, the output matrix of the deformable attention module layer represents the updated feature of the landmark of the face image.

앞서 설명한 바와 같이, 변형 가능 어텐션 모듈 레이어는 QE를 쿼리 행렬로, 메모리 특징 행렬 M을 값 행렬로 취하고, 기준점(즉 초기 랜드마크 좌표 행렬 또는 이전 디코더 레이어에서 예측한 랜드마크 좌표 행렬)에서 M을 샘플링하여 얻은 작은 그룹 특징(예, i번째 랜드마크 특징 계산 시, i번째 랜드마크 근처에 있는 K개 포인트의 특징을 사용함)에만 초점을 맞춘다.As described above, the transformable attention module layer takes QE as a query matrix, memory feature matrix M as a value matrix, and returns M from reference points (i.e., the initial landmark coordinate matrix or the landmark coordinate matrix predicted by the previous decoder layer). We focus only on a small group of features obtained by sampling (e.g., when calculating the features of the ith landmark, we use the features of the K points near the ith landmark).

예를 들어, 변형 가능 모듈 레이어의 출력 행렬을 얻은 후, 각 디코더 레이어의 랜드마크 좌표 예측 레이어를 통해 이전 디코더 레이어에서 예측한 랜드마크 좌표에 대한 현재 디코더 레이어에서 예측한 랜드마크 좌표의 오프셋

을 얻을 수 있다. 다시 말해, QD를 감지 레이어의 입력으로 취하고, 출력은

이다.For example, after obtaining the output matrix of the transformable module layer, through the landmark coordinate prediction layer of each decoder layer, the offset of the landmark coordinates predicted by the current decoder layer relative to the landmark coordinates predicted by the previous decoder layer.

can be obtained. In other words, take a QD as an input to the sensing layer, and the output is

am.

예를 들어, ReLU 활성화 함수가 있는 3개 레이어 완전 연결 네트워크를 사용하여, 처음 2개의 레이어는 선형 완전 연결과 ReLU 활성화 함수로 구성되고, 마지막 레이어는 전체 연결을 통해 좌표 편차 정보를 직접 출력하며, 추후 ReLU 활성화 함수를 추가하지 않는다. 예를 들어, 도 3의 다층 퍼셉트론(MLP)과 같다. 이는QD를 입력으로 사용하고 좌표 관련 정보를 출력한다. ReLU 활성화 함수의 공식은 다음과 같다.For example, using a three-layer fully connected network with ReLU activation function, the first two layers consist of linear fully connected and ReLU activation function, the last layer directly outputs the coordinate deviation information through full connection, Do not add ReLU activation function later. For example, it is the same as the multilayer perceptron (MLP) of FIG. 3 . It takes QDs as input and outputs coordinate-related information. The formula of the ReLU activation function is as follows.

yO를 획득한 후, 다음과 같이 현재 디코터 레이어에서 예측한 랜드마크 좌표를 얻을 수 있다:After obtaining yO, we can get the predicted landmark coordinates from the current decoder layer as follows:

여기에서, y는 현재 디코더 레이어에서 예측된 랜드마크 좌표를 나타내고,

은 초기 랜드마크 좌표 행렬(제1 디코더 레이어에 대한) 또는 이전 디코더 레이어에서 예측한 랜드마크 좌표를 나타내고,

는 랜드마크 좌표 예측 레이어의 지각 레이어의 출력을 나타내고,

에 대한 y의 상대적 오프셋을 나타낸다. 여기에서, 지각 레이어의 입력은 변형 가능 어텐션 모듈 레이어의 출력 행렬이다.where y denotes the predicted landmark coordinates in the current decoder layer,

denotes the landmark coordinates predicted by the initial landmark coordinate matrix (for the first decoder layer) or the previous decoder layer,

Represents the output of the perceptual layer of the landmark coordinate prediction layer,

Represents the offset of y relative to . Here, the input of the perceptual layer is the output matrix of the transformable attention module layer.

여기에서, σ 함수는 종래 기술로, 구체적으로 σ 함수 공식은 다음과 같다.Here, the σ function is the prior art, and specifically, the σ function formula is as follows.

마지막으로, 마지막 디코더 레이어에서 예측한 랜드마크 좌표를 최종 예측 랜드마크 좌표로 결정한다.Finally, the landmark coordinates predicted by the last decoder layer are determined as the final predicted landmark coordinates.

본 개시의 보다 명확한 이해를 위해, 3개의 디코더 레이어을 갖는 모델을 사용하여 설명한다.For a clearer understanding of the present disclosure, a model with three decoder layers is used for explanation.

제1 디코더 레이어의 경우, 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬은 각각 위치 정보가 임베드된 Qinit, 위치 정보가 임베드된 Qinit 및 Qinit이다.In the case of the first decoder layer, the query matrix, key matrix, and value matrix of the self-attention module layer are Qinit with embedded location information, Qinit with embedded location information, and Qinit, respectively.

제2 디코더 레이어의 경우, 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬은 각각 위치 정보가 임베드된 제1 디코더 레이어의 QE, 위치 정보가 임베드된 제1 디코더 레이어의 QE 및 제1 디코더 레이어의 QE이다.In the case of the second decoder layer, the query matrix, key matrix, and value matrix of the self-attention module layer are the QE of the first decoder layer in which location information is embedded, the QE of the first decoder layer in which location information is embedded, and the first decoder layer, respectively. is the QE of

제3 디코더 레이어의 경우, 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬은 각각 위치 정보가 임베드된 제2 디코더 레이어의 QE, 위치 정보가 임베드된 제2 디코더 레이어의 QE 및 제2 디코더 레이어의 QE이다.In the case of the third decoder layer, the query matrix, key matrix, and value matrix of the self-attention module layer are the QE of the second decoder layer in which location information is embedded, the QE of the second decoder layer in which location information is embedded, and the second decoder layer, respectively. is the QE of

제1 디코더 레이어 예측 랜드마크 좌표가 이용하는 것은 초기 랜드마크 좌표 행렬이고; 제2 디코더 레이어 예측 랜드마크 좌표가 이용하는 것은 제1 디코더 레이어가 예측한 랜드마크 좌표이고, 제3 디코더 레이어 예측 랜드마크 좌표가 이용하는 것은 제2 디코더 레이어가 예측한 랜드마크 좌표이다. 본 기술분야의 통상의 지식을 가진 자는 3개의 디코더 레이어를 갖는 모델은 단지 예시일 뿐이고, 모델은 하나의 디코더 레이어 또는 그 이상의 디코더 레이어를 가질 수 있으며, 제1 디코더 레이어를 제외한 다른 디코더 레이어가 유사한 입력 및 출력 동작을 갖는다는 것을 이해해야 한다.The first decoder layer predicted landmark coordinates use an initial landmark coordinate matrix; The landmark coordinates predicted by the first decoder layer are used by the second decoder layer predicted landmark coordinates, and the landmark coordinates predicted by the second decoder layer are used by the third decoder layer predicted landmark coordinates. For those skilled in the art, the model having three decoder layers is just an example, the model may have one decoder layer or more decoder layers, and other decoder layers except for the first decoder layer are similar. It should be understood that it has input and output operations.

예를 들어, 예측된 트레이닝 이미지 샘플의 랜드마크 좌표와 트레이닝 이미지 샘플의 실제 랜드마크 좌표 간의 L1 놈(norm) 손실(예측값과 실제값의 차이를 나타내는 절대값) 함수를 사용하여 모델을 트레이닝시킬 수 있고, 트레이닝에 사용된 회귀 손실 함수

는 다음과 같다.For example, you can train a model using an L1 norm loss (absolute value representing the difference between the predicted and actual value) function between the coordinates of the landmarks in the predicted training image samples and the actual coordinates of the landmarks in the training image samples. and the regression loss function used for training

is as follows

여기에서,

은 각 디코더 레이어에서 예측한 트레이닝 이미지 샘플의 랜드마크 좌표를 나타내고,

는 트레이닝 이미지 샘플의 랜드마크의 실제 좌표를 나타내고,

는 디코더 레이어의 개수이고,

은 디코더 레이어의 인덱스이다.From here,

denotes the real coordinates of the landmarks of the training image sample,

is the number of decoder layers,

is the index of the decoder layer.

본 개시의 실시예에 따른 얼굴 랜드마크 좌표 예측 방법에 따르면, 전체 예측 모델은 종단간 트레이닝이 가능하다.According to the facial landmark coordinate prediction method according to an embodiment of the present disclosure, end-to-end training of the entire prediction model is possible.

이상, 도 1 내지 도 3을 참조하여 얼굴 랜드마크를 예측하는 방법에 대해 설명하였다. 이하, 도 4 및 도 5를 참조하여 본 발명의 실시예에 따른 얼굴 랜드마크를 예측하는 장치에 대해 설명한다. In the above, the method of predicting facial landmarks has been described with reference to FIGS. 1 to 3 . Hereinafter, an apparatus for predicting facial landmarks according to an embodiment of the present invention will be described with reference to FIGS. 4 and 5 .

도 4를 참조하면, 본 개시의 또 다른 일 실시예에 따른 얼굴 랜드마크 예측 장치(400)는 인코더(401) 및 디코더(402)를 포함할 수 있다. 본 기술분야의 통상의 지식을 가진 자는 장치(400)가 다른 구성 요소들을 추가로 포함할 수 있고, 장치(400)에 포함된 구성 요소들은 분할되거나 결합될 수 있음을 이해해야 한다.Referring to FIG. 4 , an apparatus 400 for predicting facial landmarks according to another embodiment of the present disclosure may include an encoder 401 and a decoder 402 . Those skilled in the art should understand that the device 400 may additionally include other components, and the components included in the device 400 may be divided or combined.

예를 들어, 인코더(401)는, 컨볼루션 신경망 레이어를 통해 얼굴 이미지의 멀티 스테이지 특징맵을 획득하고, 완전 연결 레이어를 통해 상기 멀티 스테이지 특징맵의 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬을 획득하고, 여기에서, 초기 쿼리 행렬은 얼굴 이미지의 랜드마크의 초기 특징을 나타내고, 여기에서, 초기 쿼리 행렬 중 엘리먼트 개수는 얼굴 이미지의 랜드마크 개수과 동일하고, 멀티 스테이지 특징맵을 평면화하고 연결하여 메모리 특징 행렬을 획득하도록 구성될 수 있다.For example, the encoder 401 acquires a multi-stage feature map of a face image through a convolutional neural network layer, and fully connects the last stage feature map of the multi-stage feature map through a fully connected layer to obtain an initial query matrix. Obtain, where the initial query matrix represents initial features of landmarks in the face image, where the number of elements in the initial query matrix is equal to the number of landmarks in the face image, and multi-stage feature maps are flattened and connected to memory It may be configured to obtain a feature matrix.

예를 들어, 인코더는 컨볼루션 모듈, 완전 연결 모듈 및 평면화 및 연결 모듈로 구현될 수 있으며, 여기에서 컨볼루션 모듈은 얼굴 이미지의 멀티 스테이지 특징맵을 획득하도록 구성되고, 완전 연결 모듈은 상기 멀티 스테이지 특징맵의 마지막 스테이지 특징맵을 완전 연결하여 초기 쿼리 행렬을 획득하도록 구성되고, 평면화 및 연결 모듈은 멀티 스테이지 특징맵을 평면화 및 연결함으로써 메모리 특징 행렬을 획득하도록 구성된다.For example, the encoder may be implemented as a convolution module, a fully connected module, and a flatten and connect module, wherein the convolution module is configured to obtain a multi-stage feature map of a face image, and the fully connected module is configured to obtain the multi-stage feature map. A flattening and concatenating module is configured to obtain a memory feature matrix by flattening and concatenating the multi-stage feature maps.

예를 들어, 디코더(402)는 캐스케이드된 적어도 하나의 디코더 레이어로 구성될 수 있고, 여기에서, 적어도 하나의 디코더 레이어는 인코더에서 수신된 메모리 특징 행렬 및 초기 쿼리 행렬에 기반하여 얼굴 이미지의 랜드마크 좌표를 결정하도록 구성될 수 있다.For example, the decoder 402 may consist of at least one cascaded decoder layer, wherein the at least one decoder layer is a landmark of a facial image based on a memory feature matrix and an initial query matrix received from the encoder. It can be configured to determine coordinates.

예를 들어, 적어도 하나의 디코더 레이어 각각은 캐스케이드된 셀프 어텐션 모듈 레이어, 변형 가능 어텐션 모듈 레이어 및 랜드마크 좌표 예측 레이어를 포함한다.For example, each of the at least one decoder layer includes a cascaded self-attention module layer, a transformable attention module layer, and a landmark coordinate prediction layer.

예를 들어, 제1 디코더 레이어의 셀프 어텐션 모듈 레이어는, 수신된 위치 정보가 임베드된 초기 쿼리 행렬, 위치 정보가 임베드된 초기 쿼리 행렬 및 초기 쿼리 행렬을 기반으로 제1 디코더 레이어의 셀프 어텐션 모듈 레이어의 출력 행렬을 획득하도록 구성되고, 여기에서, 수신된 위치 정보가 임베드된 초기 쿼리 행렬, 위치 정보가 임베드된 초기 쿼리 행렬 및 초기 쿼리 행렬은 각각 제1 레이어의 셀프 어텐션 모듈 레이어의 쿼리 행렬, 키 행렬 및 값 행렬로 입력된다.For example, the self-attention module layer of the first decoder layer may determine the self-attention module layer of the first decoder layer based on the initial query matrix in which the received location information is embedded, the initial query matrix in which the location information is embedded, and the initial query matrix. Is configured to obtain an output matrix of, wherein the initial query matrix in which the received location information is embedded, the initial query matrix in which the location information is embedded, and the initial query matrix are respectively a query matrix of the self-attention module layer of the first layer, a key It is entered as a matrix and a value matrix.

제1 디코더 레이어 이외의 각 디코더 레이어의 셀프 어텐션 모듈 레이어는, 캐스케이드된 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬 및 위치 정보가 임베드된 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬을 기반으로, 현재 디코더 레이어의 변형 가능 어텐션 모듈의 출력 행렬을 획득하도록 구성되고, 여기에서, 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬 및 위치 정보가 임베드된 이전 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬은 현재 디코터 레이어의 셀프 어텐션 모듈 레이어의 값 행렬, 쿼리 행렬 및 키 행렬이다.The self-attention module layer of each decoder layer other than the first decoder layer is based on the output matrix of the transformable attention module layer of the previous cascaded decoder layer and the output matrix of the transformable attention module layer of the previous decoder layer in which position information is embedded. , configured to obtain an output matrix of the transformable attention module layer of the current decoder layer, where the output matrix of the transformable attention module layer of the previous decoder layer and the transformable attention module layer of the previous decoder layer in which the positional information is embedded. The output matrix is the value matrix, query matrix, and key matrix of the self-attention module layer of the current decoder layer.

각 디코더 레이어의 변형 가능 어텐션 모듈 레이어는, 현재 디코더 레이어의 셀프 어텐션 모듈 레이어의 출력 행렬, 메모리 특징 행렬 및 캐스케이드된 이전 디코더 레이어에서 예측한 랜드마크 좌표를 기반으로, 현재 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬을 획득하도록 구성되고, 여기에서, 현재 디코터 레이어의 셀프 어텐션 모듈 레이어의 출력 행렬 및 메모리 특징 행렬은 현재 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 쿼리 행렬 및 값 행렬이고, 여기에서, 제1 디코더 레이어의 이전 디코더 레이어에서 예측한 랜드마크 좌표는 초기 쿼리 행렬을 기반으로 획득한 초기 랜드마크 좌표이다.The transformable attention module layer of each decoder layer is the transformable attention module of the current decoder layer, based on the output matrix of the self-attention module layer of the current decoder layer, the memory feature matrix, and the landmark coordinates predicted by the cascaded previous decoder layer. configured to obtain an output matrix of a layer, wherein the output matrix and memory feature matrix of a self-attention module layer of a current decoder layer are a query matrix and a value matrix of a transformable attention module layer of a current decoder layer, wherein, The landmark coordinates predicted in the previous decoder layer of the first decoder layer are initial landmark coordinates obtained based on the initial query matrix.

각 랜드마크 좌표 예측 레이어는, 현재 디코더 레이어의 변형 가능 어텐션 모듈 레이어의 출력 행렬을 기반으로 각 디코더 레이어에서 예측한 얼굴 이미지의 랜드마크 좌표를 획득하도록 구성되고, 여기에서, 마지막 디코더 레이어에서 예측한 얼굴 이미지의 랜드마크 좌표는 최종 얼굴 이미지의 랜드마크 좌표이다.Each landmark coordinate prediction layer is configured to obtain the landmark coordinates of the face image predicted by each decoder layer based on the output matrix of the deformable attention module layer of the current decoder layer, where the prediction by the last decoder layer The landmark coordinates of the face image are the landmark coordinates of the final face image.

예를 들어, 장치(400)는 얼굴 이미지의 랜드마크의 초기 좌표를 획득하기 위해 초기 쿼리 행렬에 대한 완전 연결 연산을 수행하는 완전 연결 레이어를 추가로 포함할 수 있다.For example, the device 400 may further include a fully connected layer that performs a fully connected operation on an initial query matrix to obtain initial coordinates of landmarks of a face image.

도 5는 본 개시의 다른 일 실시예에 따른 얼굴 이미지의 랜드마크 좌표 예측 장치(500)의 구조를 도시한 블록도이다.5 is a block diagram showing the structure of an apparatus 500 for predicting landmark coordinates of a face image according to another embodiment of the present disclosure.

일반적으로, 전자 장치(500)는 프로세서(501) 및 메모리(502)를 포함한다.In general, the electronic device 500 includes a processor 501 and a memory 502 .

프로세서(501)는 4-코어 프로세서, 8-코어 프로세서 등과 같은 하나 이상의 프로세싱 코어를 포함할 수 있다. 프로세서(1001)는 DSP(digital signal processing), FPGA(field programmable gate array) 및 PLA(programmable logic array) 중 적어도 하나의 하드웨어 형태를 사용하여 구현될 수 있다. 프로세서(501)는 또한 메인 프로세서 및 보조 프로세서를 포함할 수 있고, 메인 프로세서는 웨이크업 상태에서 데이터를 처리하는 프로세서로, CPU라고도 한다. 보조 프로세서는 대기 상태에서 데이터를 처리하기 위한 저전력 프로세서이다. 일부 실시예에서, 프로세서(501)는 디스플레이 화면에 표시되어야 하는 콘텐츠를 렌더링 및 드로잉하는데 사용되는 GPU(Graphics Processing Unit)와 통합될 수 있다. 일부 실시예에서, 프로세서(501)는 머신 러닝과 관련된 컴퓨팅 동작을 처리하는데 사용되는 AI 프로세서를 더 포함할 수 있다.Processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1001 may be implemented using at least one hardware form among digital signal processing (DSP), field programmable gate array (FPGA), and programmable logic array (PLA). The processor 501 may also include a main processor and an auxiliary processor, and the main processor is a processor that processes data in a wake-up state, also referred to as a CPU. The co-processor is a low-power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a graphics processing unit (GPU) used for rendering and drawing content to be displayed on a display screen. In some embodiments, processor 501 may further include an AI processor used to process computing operations related to machine learning.

메모리(502)는 비일시적일 수 있는 하나 이상의 컴퓨터 판독가능 저장 매체를 포함할 수 있다. 메모리(502)는 또한 고속 랜덤 액세스 메모리, 하나 이상의 디스크 저장 장치 및 플래시 메모리 장치와 같은 비휘발성 메모리를 포함할 수 있다. 일부 실시예에서, 메모리(502)의 비일시적 컴퓨터 판독가능 저장 매체는 프로세서(501)에 의해 실행되어 본 개시의 반향 제거 모델을 트레이닝하고 및/또는 반향 지연을 제거하는 방법을 구현하는데 사용되는 적어도 하나의 명령을 저장한다. Memory 502 may include one or more computer readable storage media that may be non-transitory. Memory 502 may also include non-volatile memory, such as high-speed random access memory, one or more disk storage devices, and flash memory devices. In some embodiments, the non-transitory computer readable storage medium of memory 502 is at least one used to be executed by processor 501 to train an echo cancellation model and/or implement a method of canceling echo delay of the present disclosure. Save one command.

일부 실시예에서, 전자 장치(500)는 주변 장치 인터페이스(503) 및 적어도 하나의 주변 장치를 더 포함할 수 있다. 프로세서(501), 메모리(502) 및 주변 장치 인터페이스(503)는 버스 또는 신호 라인에 의해 연결될 수 있다. 각 주변 장치는 버스, 신호 라인 또는 회로 기판을 통해 주변 장치 인터페이스(503)에 연결될 수 있다. 구체적으로, 주변 장치는 RF(Radio Frequency) 회로(504), 터치 스크린(505), 카메라(506), 오디오 회로(507), 포지셔닝 컴포넌트(508) 및 전원(509)을 포함한다.In some embodiments, the electronic device 500 may further include a peripheral device interface 503 and at least one peripheral device. Processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 503 through a bus, signal line, or circuit board. Specifically, the peripheral device includes a radio frequency (RF) circuit 504, a touch screen 505, a camera 506, an audio circuit 507, a positioning component 508 and a power supply 509.

주변 장치 인터페이스(503)는 I/O(input/output)와 관련된 적어도 하나의 주변 장치를 프로세서(501) 및 메모리(502)에 연결하는데 사용될 수 있다. 일부 실시예에서, 프로세서(501), 메모리(502) 및 주변 장치 인터페이스(503)는 동일한 칩 또는 회로 기판에 통합되고; 일부 다른 실시예에서, 프로세서(501), 메모리(502) 및 주변 장치 인터페이스(503) 중 임의의 하나 이상은 별도의 칩 또는 회로 기판 상에 구현될 수 있고, 본 실시예에서는 이에 대해 제한되지 않는다.The peripheral device interface 503 may be used to connect at least one peripheral device related to I/O (input/output) to the processor 501 and the memory 502 . In some embodiments, processor 501, memory 502 and peripheral interface 503 are integrated on the same chip or circuit board; In some other embodiments, any one or more of processor 501, memory 502, and peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment. .

RF 회로(504)는 RF(무선 주파수) 신호를 수신 및 전송하는데 사용되며, 전가지 신호라도 불린다. RF 회로(504)는 전자기 신호를 통해 통신 네트워크 및 기타 통신 장치와 통신한다. RF 회로(504)는 송신을 위해 전기 신호를 전자기 신호로 변환하거나, 수신된 전자기 신호를 전기 신호로 변환한다. 선택적으로, RF 회로(504)는 안테나 시스템, RF 송수신기, 하나 이상의 증폭기, 튜너, 발진기, 디지털 신호 프로세서, 코덱 칩셋, 사용자 식별 모듈 카드 등을 포함한다. RF 회로(504)는 적어도 하나의 무선 통신 프로토콜을 통해 다른 단말들과 통신할 수 있다. 해당 무선 통신 프로토콜은 대도시 지역 네트워크, 다양한 세대의 이동 통신 네트워크(2G, 3G, 4G 및 5G), 무선 근거리 통신망 및/또는 WiFi 네트워크를 포함하지만 이에 국한되지 않는다. 일부 실시예에서, RF 회로(504)는 또한 NFC(근거리 무선 통신)와 관련된 회로를 포함할 수 있고, 본 개시에는 이에 대해 제한하지 않는다.The RF circuit 504 is used to receive and transmit RF (radio frequency) signals, also referred to as full spectrum signals. RF circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuitry 504 converts electrical signals to electromagnetic signals for transmission or converts received electromagnetic signals to electrical signals. Optionally, RF circuitry 504 includes an antenna system, RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, user identification module cards, and the like. The RF circuitry 504 can communicate with other terminals via at least one wireless communication protocol. Applicable wireless communication protocols include, but are not limited to, metropolitan area networks, multi-generational mobile communication networks (2G, 3G, 4G and 5G), wireless local area networks and/or WiFi networks. In some embodiments, RF circuitry 504 may also include circuitry related to NFC (Near Field Communication), without limitation in this disclosure.

디스플레이 스크린(505)은 UI(사용자 인터페이스)를 디스플레이하기 위해 사용된다. 해당 UI는 그래픽, 텍스트, 아이콘, 비디오 및 이들의 임의의 조합을 포함할 수 있다. 디스플레이 스크린(505)이 터치 스크린일 때, 디스플레이 스크린(505)은 또한 디스플레이 스크린(505)의 표면 또는 표면 상부에서 터치 신호를 수집하는 능력을 갖는다. 해당 터치 신호는 처리를 위한 제어 신호로서 프로세서(501)에 입력될 수 있다. 이때, 디스플레이 스크린(505)은 또한 소프트 버튼 및/또는 소프트 키보드로도 알려진 가상 버튼 및/또는 가상 키보드를 제공하는데 사용될 수 있다. 일부 실시예에서, 디스플레이 스크린(505)은 전자 장치(500)의 전면 패널에 배치된 하나일 수 있고; 다른 실시예에서, 디스플레이 스크린(505)은 적어도 2개일 수 있으며, 각각 단말(1000)의 상이한 표면 상에 배치되거나 접힌 디자인으로 배치될 수 있고; 또 다른 실시예에서, 디스플레이 스크린(505)은 단말(1000)의 곡면 또는 접힌 면에 배치된 플렉서블 디스플레이 스크린일 수 있다. 디스플레이 스크린(505)은 직사각형이 아닌 불규칙한 도형, 즉 특수한 형태의 화면으로 설정될 수 있다. 디스플레이 스크린(505)은 LCD, OLED 등과 같은 물질로 제조될 수 있다.The display screen 505 is used to display a UI (User Interface). The corresponding UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch screen, the display screen 505 also has the ability to collect touch signals at or on the surface of the display screen 505 . The touch signal may be input to the processor 501 as a control signal for processing. At this time, the display screen 505 can also be used to provide virtual buttons and/or a virtual keyboard, also known as soft buttons and/or a soft keyboard. In some embodiments, display screen 505 may be one disposed on the front panel of electronic device 500; In another embodiment, there may be at least two display screens 505, each disposed on a different surface of the terminal 1000 or in a folded design; In another embodiment, the display screen 505 may be a flexible display screen disposed on a curved or folded surface of the terminal 1000 . The display screen 505 may be set to have an irregular shape other than a rectangle, that is, a screen having a special shape. The display screen 505 may be made of a material such as LCD, OLED, or the like.

카메라 컴포넌트(506)는 이미지 또는 비디오를 수집하는데 사용된다. 선택적으로, 카메라 컴포넌트(506)는 전면 카메라 및 후면 카메라를 포함한다. 일반적으로 전면 카메라는 단말기의 전면 패널에 설정되고 후면 카메라는 단말기 후면에 설정된다. 일부 실시예에서, 후면 카메라는 적어도 2개로, 이는 각각 메인 카메라, 피사계 심도 카메라, 광각 카메라, 망원 카메라 중 임의의 하나이며, 메인 카메라와 피사계 심도 카메라를 융합하여 배경 흐림 기능을 구현하고, 메인 카메라와 광각 카메라를 융합하여 파노라마 촬영 및 VR(가상 현실, 가상 현실) 촬영 기능 또는 기타 융합 촬영 기능을 구현한다. 일부 실시예에서, 카메라 컴포넌트(506)는 또한 플래시를 포함할 수 있다. 플래시는 단일 색온도 플래시 또는 듀얼 색온도 플래시일 수도 있다. 듀얼 색온도 플래시는 따뜻한 빛 플래시와 차가운 빛 플래시의 조합을 말하며 서로 다른 색온도에서 빛을 보정하는데 사용할 수 있다.A camera component 506 is used to collect images or video. Optionally, camera component 506 includes a front camera and a back camera. In general, the front camera is set on the front panel of the terminal and the rear camera is set on the back of the terminal. In some embodiments, there are at least two rear cameras, each of which is any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, the main camera and the depth-of-field camera are fused to implement a background blur function, and the main camera By converging with a wide-angle camera, panoramic shooting and VR (virtual reality, virtual reality) shooting functions or other convergence shooting functions are realized. In some embodiments, camera component 506 may also include a flash. The flash may be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to a combination of a warm light flash and a cool light flash and can be used to compensate for light at different color temperatures.

오디오 회로(507)는 마이크 및 스피커를 포함할 수 있다. 마이크는 사용자 및 환경의 음파를 수집하고, 음파를 전기 신호로 변환하여 처리를 위해 프로세서(501)에 입력하거나, 음성 통신을 실현하기 위해 RF 회로(504)에 입력하는데 사용된다. 스테레오 획득 또는 노이즈 감소를 위해 마이크는 복수일 수 있으며, 이는 단말(500)의 서로 다른 부분에 각각 배치될 수 있다. 마이크는 어레이 마이크 또는 무지향성 획득형 마이크일 수도 있다. 스피커는 프로세서(501) 또는 RF 회로(504)로부터의 전기 신호를 음파로 변환하기 위해 사용된다. 스피커는 종래의 박막 스피커 또는 압전 세라믹 스피커일 수 있다. 스피커가 압전 세라믹 스피커인 경우, 전기 신호를 사람이 들을 수 있는 음파로 변환할 수 있을 뿐만 아니라, 전기 신호를 사람이 들을 수 없는 음파로 변환하여 거리 측정 및 기타 용도로 사용할 수 있다. 일부 실시예에서, 오디오 회로(507)는 이어폰 잭을 더 포함할 수 있다.The audio circuit 507 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, convert the sound waves into electrical signals and input them to the processor 501 for processing or to the RF circuitry 504 to realize voice communication. For stereo acquisition or noise reduction, a plurality of microphones may be provided, which may be disposed in different parts of the terminal 500, respectively. The microphone may be an array microphone or an omni-directional acquisition type microphone. A speaker is used to convert electrical signals from the processor 501 or RF circuit 504 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. If the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves that humans can hear, but also convert electrical signals into sound waves that humans can't hear, which can be used for distance measurement and other purposes. In some embodiments, audio circuitry 507 may further include an earphone jack.

포지셔닝 컴포넌트(508)는 내비게이션 또는 LBS(위치 기반 서비스)를 구현하기 위해 전자 장치(500)의 현재 지리적 위치를 찾는데 사용된다. 포지셔닝 컴포넌트(508)는 미국의 GPS, 중국의 베이더우 시스템, 러시아의 글로나스(GLONASS) 시스템 또는 EU의 갈릴레오 시스템에 기반한 포지셔닝 컴포넌트일 수 있다.The positioning component 508 is used to find the current geographic location of the electronic device 500 to implement navigation or location-based services (LBS). The positioning component 508 may be a positioning component based on the US GPS, the Chinese Beidou system, the Russian GLONASS system, or the EU Galileo system.

전원(509)은 전자 장치(500)의 각 구성 요소에 전원을 공급하기 위해 사용된다. 전원(509)은 AC, DC, 일회용 배터리 또는 재충전 가능한 배터리일 수 있다. 전원(509)이 충전식 배터리를 포함하는 경우, 해당 충전식 배터리는 유선 충전 또는 무선 충전을 지원할 수 있다. 해당 충전식 배터리는 고속 충전 기술을 지원하는 데에도 사용할 수 있다.The power source 509 is used to supply power to each component of the electronic device 500 . The power source 509 can be AC, DC, disposable batteries or rechargeable batteries. If the power source 509 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery can also be used to support fast charging technology.

일부 실시예에서, 전자 장치(500)는 또한 하나 이상의 센서(510)를 포함한다. 해당 하나 이상의 센서(510)는 가속도 센서(511), 자이로 센서(512), 압력 센서(513), 지문 센서(514), 광학 센서(515) 및 근접 센서(516)를 포함하나 이에 제한되지 않는다.In some embodiments, electronic device 500 also includes one or more sensors 510 . The one or more sensors 510 include, but are not limited to, an acceleration sensor 511, a gyro sensor 512, a pressure sensor 513, a fingerprint sensor 514, an optical sensor 515, and a proximity sensor 516. .

가속도 센서(511)는 단말(500)이 설정한 좌표계의 3개의 좌표축에 대한 가속도의 크기를 감지할 수 있다. 예를 들어, 가속도 센서(511)는 3개의 좌표축에 대한 중력 가속도 성분을 감지하는데 사용될 수 있다. 프로세서(501)는 가속도 센서(511)에서 수집된 중력 가속도 신호에 따라 사용자 인터페이스가 수평 또는 수직으로 표시되도록 터치 스크린(505)을 제어할 수 있다. 가속도 센서(511)는 게임 또는 사용자의 움직임 데이터를 수집하는데도 사용될 수 있다. The acceleration sensor 511 may detect magnitudes of acceleration with respect to three coordinate axes of the coordinate system set by the terminal 500 . For example, the acceleration sensor 511 may be used to detect gravitational acceleration components on three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface horizontally or vertically according to the gravitational acceleration signal collected by the acceleration sensor 511 . The acceleration sensor 511 may also be used to collect game or user movement data.

자이로 센서(512)는 단말기(500)의 본체 방향 및 회전 각도를 감지할 수 있고, 자이로 센서(512)는 가속도 센서(511)와 협력하여 단말(500) 상에서 사용자의 3차원 동작을 수집할 수 있다. 프로세서(501)는 자이로 센서(512)에 의해 수집된 데이터에 따라 다음 기능을 구현할 수 있다: 모션 감지(예, 사용자의 틸트 조작에 따라 UI 변경), 촬영 중 이미지 안정화, 게임 제어 및 관성 탐색.The gyro sensor 512 may detect the direction and rotation angle of the main body of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to collect a user's 3D motion on the terminal 500. there is. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion detection (eg, UI change according to user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

압력 센서(513)는 단말(500)의 측면 프레임 및/또는 터치 스크린(505)의 하부에 배치될 수 있다. 압력 센서(513)가 단말(500)의 측면 프레임에 설정되면, 단말(500)에 대한 사용자의 파지 신호를 감지할 수 있고, 프로세서(501)는 압력 센서(513)에서 수집한 파지 신호에 따라 왼손과 오른손을 인식하거나 빠른 조작을 수행할 수 있다. 터치 스크린(505)의 하층에 압력 센서(513)가 배치되면, 프로세서(501)는 터치 스크린(505)에 대한 사용자의 압력 조작에 따라 UI 상의 조작성 제어를 제어한다. 조작성 제어는 버튼 제어, 스크롤 바 제어, 아이콘 제어 및 메뉴 제어 중 적어도 하나를 포함한다.The pressure sensor 513 may be disposed below the side frame of the terminal 500 and/or the touch screen 505 . When the pressure sensor 513 is set on the side frame of the terminal 500, it can detect the user's gripping signal for the terminal 500, and the processor 501 is configured according to the gripping signal collected by the pressure sensor 513. It can recognize left and right hands or perform quick operations. When the pressure sensor 513 is disposed below the touch screen 505 , the processor 501 controls operability control on the UI according to the user's pressure manipulation on the touch screen 505 . The operability control includes at least one of button control, scroll bar control, icon control, and menu control.

지문 센서(514)는 사용자의 지문을 수집하기 위해 사용되며, 프로세서(501)는 지문 센서(514)에 의해 수집된 지문에 따라 사용자의 신원을 식별하거나, 지문 센서(514)는 수집된 지문에 따라 사용자의 신원을 식별한다. 사용자의 신원이 신뢰할 수 있는 신원으로 인식되면, 프로세서(501)는 화면 잠금 해제, 암호화된 정보 보기, 소프트웨어 다운로드, 결제 및 설정 변경을 포함하는 관련 민감한 작업을 수행하도록 사용자에게 권한을 부여한다. 지문 센서(514)는 전자 장치(500)의 전면, 후면 또는 측면에 배치될 수 있다. 전자 장치(500)에 물리적 키 또는 제조사 로고가 제공되는 경우, 지문 센서(514)는 물리적 키 또는 제조사 로고에 통합될 수 있다. The fingerprint sensor 514 is used to collect the user's fingerprint, and the processor 501 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 uses the collected fingerprint to identify the user's identity. Once the user's identity is recognized as a trusted identity, the processor 501 authorizes the user to perform relevant sensitive tasks including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings. The fingerprint sensor 514 may be disposed on the front, rear, or side of the electronic device 500 . If the electronic device 500 is provided with a physical key or manufacturer's logo, the fingerprint sensor 514 may be integrated with the physical key or manufacturer's logo.

광학 센서(515)는 주변 광 강도를 수집하는데 사용된다. 일 실시예에서, 프로세서(501)는 광학 센서(515)에 의해 수집된 주변 광 세기에 따라 터치 스크린(505)의 디스플레이 밝기를 제어할 수 있다. 구체적으로, 주변 광 세기가 높을 때, 터치 스크린(505)의 디스플레이 밝기가 증가하고; 주변 광도가 낮으면 터치 스크린(505)의 디스플레이 밝기 또한 낮아진다. 다른 실시예에서, 프로세서(501)는 광학 센서(515)에 의해 수집된 주변 광 세기에 따라 카메라 컴포넌트(506)의 촬영 파라미터를 동적으로 조절할 수 있다.An optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch screen 505 according to the ambient light intensity collected by the optical sensor 515 . Specifically, when the ambient light intensity is high, the display brightness of the touch screen 505 increases; When the ambient light intensity is low, the display brightness of the touch screen 505 is also lowered. In another embodiment, processor 501 may dynamically adjust imaging parameters of camera component 506 based on ambient light intensity collected by optical sensor 515 .

거리 센서로도 알려진 근접 센서(516)는 일반적으로 전자 장치(500)의 전면 패널에 제공된다. 근접 센서(516)는 사용자와 전자 장치(500)의 전면 사이의 거리를 수집하는 데 사용된다. 일 실시예에서, 근접 센서(516)가 사용자와 단말(500)의 전면 사이의 거리가 점차적으로 감소하는 것을 감지하는 경우, 프로세서(501)는 터치 스크린(505)을 제어하여 밝은 화면 상태에서 오프 화면 상태로 전환하고; 근접 센서(516)가 사용자와 전자 장치(500)의 전면 사이의 거리가 점차 멀어지는 것을 감지하면, 프로세서(501)는 오프 화면 상태에서 밝은 화면 상태로 전환되도록 터치 스크린(505)을 제어할 수 있다.Proximity sensor 516, also known as distance sensor, is typically provided on the front panel of electronic device 500. Proximity sensor 516 is used to collect the distance between the user and the front of electronic device 500 . In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front of the terminal 500 gradually decreases, the processor 501 controls the touch screen 505 to turn it off in a bright screen state. switch to screen state; When the proximity sensor 516 detects that the distance between the user and the front of the electronic device 500 gradually increases, the processor 501 may control the touch screen 505 to switch from an off-screen state to a bright screen state. .

본 기술분야의 통상의 지식을 가진 자는, 도 5에 도시된 구조가 전자 장치(500)에 대한 제한을 구성하지 않으며, 도시된 것보다 더 많거나 더 적은 구성 요소를 포함하거나, 일부 구성 요소를 결합하거나, 상이한 구성 요소 배열을 채택할 수 있음을 이해할 수 있다.Those skilled in the art will understand that the structure shown in FIG. 5 does not constitute a limitation on the electronic device 500, includes more or fewer components than those shown, or some components. It is to be understood that combinations or different component arrangements may be employed.

본 개시의 일 실시예에 따르면, 컴퓨터 판독 가능 저장 매체를 더 제공할 수 있고, 여기에서, 컴퓨터 판독 가능 저장 매체의 명령은 적어도 하나의 프로세서에 의해 실행될 때, 본 개시에 따른 얼굴 이미지의 랜드마크 좌표를 예측하는 방법을 실행하도록 적어도 하나의 프로세서를 프롬프트(prompt)한다. 여기서 컴퓨터 판독 가능 저장 매체의 예시로는, 읽기 전용 메모리(ROM), 임의 액세스 프로그래밍 가능 읽기 전용 메모리(PROM), 전기적 삭제 가능한 프로그래밍 가능 읽기 전용 메모리(EEPROM), 임의 액세스 메모리(RAM), 동적 임의 액세스 메모리(DRAM), 정적 임의의 액세스 메모리(SRAM), 플래시 메모리, 비휘발성 메모리, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, BLU-RAY 또는 광 디스크 메모리, 하드 디스크 드라이브(HDD), 솔리드 스테이트 하드 디스크(SSD), 카드 메모리(예, 멀티미디어 카드, 보안 디지털(SD) 카드 또는 초고속 디지털(XD) 카드), 자기 테이프, 플로피 디스크, 광자기 데이터 저장 장치, 광 데이터 저장 장치, 하드 디스크, 솔리드 스테이트 디스크 및 기타 장치를 포함할 수 있다. 상기 임의의 다른 장치는 컴퓨터 프로그램 및 임의의 관련 데이터, 데이터 파일 및 데이터 구조를 일시적이지 않은 방식으로 저장하고, 상기 컴퓨터 프로그램 및 임의의 관련 데이터, 데이터 파일 및 데이터 구조를 프로세서 또는 컴퓨터에 제공하여 프로세서 또는 컴퓨터가 상기 컴퓨터 프로그램을 실행할 수 있도록 구성된다. 상기 컴퓨터 판독 가능 저장 매체의 컴퓨터 프로그램은 클라이언트, 호스트, 프록시 장치, 서버 등과 같은 컴퓨터 장치에 배치된 환경에서 실행될 수 있으며, 또한 일 예시에서, 컴퓨터 프로그램 및 임의의 관련 데이터, 데이터 파일 데이터 구조는 네트워크로 연결된 컴퓨터 시스템에 배포되며, 이를 통해 컴퓨터 프로그램 및 모든 관련 데이터, 데이터 파일 및 데이터 구조가 하나 이상의 프로세서 또는 컴퓨터를 통해 분산 방식으로 저장, 액세스 및 실행될 수 있다. According to one embodiment of the present disclosure, a computer readable storage medium may be further provided, wherein, when instructions of the computer readable storage medium are executed by at least one processor, a landmark of a face image according to the present disclosure Prompts at least one processor to execute the method of predicting coordinates. Here, examples of the computer readable storage medium include read only memory (ROM), random access programmable read only memory (PROM), electrically erasable programmable read only memory (EEPROM), random access memory (RAM), and dynamic random access memory. Access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, BLU-RAY or Optical Disc Memory, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (e.g. multimedia card, secure digital (SD) card or ultra-high speed digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid-state It may contain state disks and other devices. Any other device for storing a computer program and any associated data, data files and data structures in a non-transitory manner, and providing the computer program and any associated data, data files and data structures to a processor or computer for processing or a computer is configured to be able to execute the computer program. The computer program of the computer-readable storage medium may be executed in an environment deployed on a computer device such as a client, host, proxy device, server, etc., and in one example, the computer program and any related data, data file data structures may be connected to a network distributed on connected computer systems, whereby computer programs and all related data, data files and data structures may be stored, accessed and executed in a distributed manner by one or more processors or computers.

본 개시의 실시예에 따르면, 컴퓨터 프로그램 제품을 더 제공할 수 있고, 해당 컴퓨터 프로그램 제품의 명령은 상기 얼굴 이미지의 랜드마크 좌표를 예측하는 방법을 구현하기 위해 컴퓨터 장치의 프로세서에 의해 실행될 수 있다.According to an embodiment of the present disclosure, a computer program product may be further provided, and instructions of the computer program product may be executed by a processor of a computer device to implement a method for predicting landmark coordinates of the face image.

본 기술분야의 통상의 지식을 가진 자는 본 명세서에 개시된 발명을 설명하고 실시한 후에 본 개시의 다른 실시예를 쉽게 생각할 수 있을 것이다. 본 출원은 본 개시의 일반적인 원리를 따르고, 본 개시에 개시되지 않은 기술 분야에서의 통상적인 지식 또는 통상적 기술적 수단을 포함하는, 본 개시의 임의의 변형, 용도 또는 적응적 변경을 포함하도록 의도된다. 명세서 및 실시예는 단지 예시적인 것으로 간주되며, 본 개시의 진정한 범위 및 사상은 첨부된 청구범위에 의해 해석된다.Those skilled in the art will readily be able to conceive of other embodiments of the present disclosure after having described and practiced the invention disclosed herein. This application follows the general principles of this disclosure and is intended to cover any variations, uses or adaptations of this disclosure, including common knowledge or common technical means in the art not disclosed in this disclosure. The specification and examples are to be regarded as illustrative only, with the true scope and spirit of the disclosure being construed by the appended claims.

Claims

In the method of predicting landmark coordinates of a face image,
Acquiring a multi-stage feature map of a face image through a convolutional neural network layer;
Obtaining an initial query matrix by fully connecting last stage feature maps of the multi-stage feature maps through a fully connected layer, the initial query matrix representing initial features of landmarks of the face image, The number of elements in the initial query matrix is equal to the number of landmarks of the face image -;
obtaining a memory feature matrix by flattening and concatenating the multi-stage feature maps; and
Determining the landmark coordinates of the face image by inputting the memory feature matrix and the initial query matrix to at least one cascaded decoder layer
including,
A method for predicting coordinates of landmarks in face images.

According to claim 1,
The decoder layer includes a cascaded self-attention module layer, a deformable attention module layer, and a landmark coordinate prediction layer,
Determining the landmark coordinates of the face image,
inputting an initial query matrix and an initial query matrix in which location information is embedded into a self-attention module layer of a first decoder layer as a query matrix, a key matrix, and a value matrix;
Transformable Attention obtaining an output matrix of a module layer, wherein the output matrix and the memory feature matrix of the self-attention module layer are the query matrix and the value matrix of the transformable attention module layer, and the transformation of the first decoder layer The landmark coordinates predicted in the previous decoder layer input to the possible attention module layer are initial landmark coordinates obtained based on the initial query matrix -;
The output matrix of the transformable attention module layer of the current decoder layer and the output matrix of the transformable attention module layer in which the position information is embedded are cascaded into a value matrix of a self-attention module layer of the next decoder layer. , as a query matrix and a key matrix, inputting to the self-attention module layer of the next decoder layer; and
obtaining the landmark coordinates of the face image predicted by the current decoder layer by inputting the output matrix of the deformable attention module layer of the current decoder layer to a landmark coordinate prediction layer of the current decoder layer; - The landmark coordinates of the face image predicted by the last decoder layer are the landmark coordinates of the final face image -;
including,
A method for predicting coordinates of landmarks in face images.

According to claim 2,
The output matrix of the self-attention module layer of the decoder layer,

is obtained with
From here,

denotes the i-th row vector of the output matrix,

Represents the j-th row vector of the initial query matrix or the output matrix of the transformation attention layer of the previous decoder layer, N represents the number of landmarks in the face image,
A method for predicting coordinates of landmarks in face images.

According to claim 3,
The output matrix of the transformable attention module layer of the decoder layer,

is obtained as, where,

denotes the updated feature of the ith landmark of the output matrix,

denotes a feature corresponding to the k-th reference point coordinate in the memory feature matrix, and the location offset between the k-th reference point coordinate and the i-th landmark coordinate of the landmark coordinate predicted in the previous decoder layer is the query input to the transformable attention module layer. Obtained by performing a full-complete operation on the matrix, K is a preset value,
A method for predicting coordinates of landmarks in face images.

According to claim 2,
The landmark coordinates predicted by the decoder layer,

, where y represents the landmark coordinates predicted by the current decoder layer,

Is

Representing the output of the landmark coordinate prediction layer representing the offset of y for
A method for predicting coordinates of landmarks in face images.

According to claim 2,
The convolutional network layer, the fully connected layer, and the at least one decoder layer are obtained through training using training image samples based on the regression loss function below,

From here,

represents the regression loss,

denotes the actual landmark coordinates of the training image sample,

is the quantity of decoder layers,

is the index of the decoder layer,
A method for predicting coordinates of landmarks in face images.

An apparatus for predicting landmark coordinates of a face image,
Obtaining a multi-stage feature map of the face image through a convolutional neural network layer, and fully connecting a last stage feature map of the multi-stage feature map through a fully connected layer to obtain an initial query matrix, wherein the initial query matrix is Represents initial features of landmarks of a face image, and the number of elements in the initial query matrix is equal to the number of landmarks of the face image - an encoder configured to obtain a memory feature matrix by flattening and concatenating the multi-stage feature maps. ; and
A decoder comprising at least one cascaded decoder layer, the at least one decoder layer configured to determine the landmark coordinates of the face image based on the memory feature matrix and the initial query matrix received from the encoder. configured -;
including,
Device.

An apparatus for predicting landmark coordinates of a face image,
processor; and
memory storing one or more instructions executed by the processor
wherein the processor executes the instructions to:
obtaining a multi-stage feature map of a face image through a convolutional neural network layer;
Obtaining an initial query matrix by fully connecting last stage feature maps of the multi-stage feature maps through a fully connected layer, wherein the initial query matrix represents initial features of landmarks of the face image, and The number of elements in the initial query matrix is equal to the number of landmarks of the face image -;
obtaining a memory feature matrix by flattening and concatenating the multi-stage feature maps; and
Determining the landmark coordinates of the face image by inputting the memory feature matrix and the initial query matrix to at least one cascaded decoder layer
configured to perform
Device.

According to claim 7 or 8,
The decoder layer includes a cascaded self-attention module layer, a transformable attention module layer, and a landmark coordinate prediction layer,
The self-attention module layer of the first decoder layer is configured to obtain an initial query matrix into which location information is embedded and an output matrix of the self-attention module layer of the first decoder layer based on the initial query matrix, wherein the location information is embedded. The initial query matrix and the initial query matrix are input as a query matrix, a key matrix, and a value matrix of the self-attention module layer of the first decoder layer, respectively;
Self-attention module layers of decoder layers other than the first decoder layer include an output matrix of a deformable attention module layer of a cascaded previous decoder layer and a deformable deformable of a previous decoder layer in which the position information is embedded. Acquiring an output matrix of a transformable attention module of a current decoder layer based on an output matrix of the attention module layer, wherein the output matrix of the transformable attention module layer of the previous decoder layer and the position information are embedded. The output matrix of the deformable attention module layer of the previous decoder layer is the value matrix, query matrix, and key matrix of the self-attention module layer of the current decoder layer,
The deformable attention module layer of the decoder layer is based on the output matrix of the self-attention module layer of the current decoder layer, the memory feature matrix, and the landmark coordinates predicted in the cascaded previous decoder layer. and obtain the output matrix of the transformable attention module layer, wherein the output matrix of the self-attention module layer of the current decoder layer and a memory feature matrix are a query matrix of the transformable attention module layer of the current decoder layer. and a value matrix, wherein the landmark coordinates predicted in the previous decoder layer of the first decoder layer are initial landmark coordinates obtained based on the initial query matrix;
The landmark coordinate prediction layer is configured to obtain the landmark coordinates of the face image predicted by the decoder layer based on the output matrix of the deformable attention module layer of the current decoder layer,
The landmark coordinates of the face image predicted by the last decoder layer are the landmark coordinates of the final face image, device.

According to claim 9,
The self-attention module layer of the decoder layer,

Obtaining the output matrix with
From here,

denotes the i-th row vector of the output matrix,

Represents the j-th row vector of the initial query matrix or the output matrix of the transformation attention layer of the previous decoder layer, N represents the number of landmarks in the face image,
Device.

According to claim 10,
The deformable attention module layer of the decoder layer,

Obtaining the output matrix with
From here,

denotes the updated feature of the ith landmark of the output matrix,

denotes a feature corresponding to the k-th reference point coordinate in the memory feature matrix, and the location offset between the k-th reference point coordinate and the i-th landmark coordinate of the landmark coordinate predicted in the previous decoder layer is the query input to the transformable attention module layer. Obtained by performing a full-complete operation on the matrix, K is a preset value,
Device.

According to claim 9,
The landmark coordinate prediction layer of the decoder layer,

Predicting the landmark coordinates with
Here, y represents the landmark coordinates predicted by the current decoder layer,

Is

Representing the output of the landmark coordinate prediction layer representing the offset of y for
Device.

According to claim 9,
The device is trained using training image samples based on the regression loss function below,

From here,

represents the regression loss,

denotes the actual landmark coordinates of the training image sample,

is the quantity of decoder layers,

is the index of the decoder layer,
Device.

A computer-readable storage medium storing a computer program implementing a method of predicting landmark coordinates of a face image according to any one of claims 1 to 6 when executed by a processor.