WO2023043001A1

WO2023043001A1 - Attention map transferring method and device for enhancement of face recognition performance of low-resolution image

Info

Publication number: WO2023043001A1
Application number: PCT/KR2022/008543
Authority: WO
Inventors: 신성호; 이규빈; 이주순; 이준석; 전창현
Original assignee: 광주과학기술원
Priority date: 2021-09-14
Filing date: 2022-06-16
Publication date: 2023-03-23

Abstract

The present invention relates to an attention map transferring method for enhancement of face recognition performance of a low-resolution image. The attention map transferring method comprises the steps of: learning a high-resolution face recognition network for recognizing a face of a random person on the basis of multiple high-resolution images including the face of the random person; extracting a first attention map associated with the multiple high-resolution images from the learned high-resolution face recognition network; transferring the extracted first attention map to a low-resolution face recognition network for recognizing a face of a random person on the basis of multiple low-resolution images including the face of the random person; and learning the low-resolution face recognition network by using the transferred first attention map.

Description

Method and Apparatus for Passing Attention Map for Improving Face Recognition Performance of Low Resolution Image

The present invention relates to a method and apparatus for transmitting an attention map for improving face recognition performance of a low-resolution image, and more particularly, to a method and apparatus for transmitting an attention map using knowledge distillation.

In the field of computer vision, face recognition for identifying people included in an image is an important task. For example, a trained machine learning model may receive images containing people's faces, and detect and identify people's faces within the received images. In general, high-resolution images in which people's faces are clearly displayed are required for such face recognition. In contrast, when a low-resolution image is used, the accuracy of face recognition is significantly reduced.

Meanwhile, research to improve the accuracy of face recognition using low-resolution images has been continued. For example, there is a method of using a network that converts a low-resolution image into a high-resolution image, such as SR (super resolution), and then performing face recognition using the converted high-resolution image. However, in the case of this method, there is a problem in that a network with a larger capacity is additionally required for resolution conversion.

The present invention provides an attention map transfer method, a computer program stored in a recording medium, and an apparatus (system) to solve the above problems.

The present invention may be implemented in a variety of ways, including a method, apparatus (system) or computer program stored on a readable storage medium.

According to an embodiment of the present invention, an attention map transfer method for improving face recognition performance of a low-resolution image, performed by at least one processor, includes a method for recognizing a human face based on a high-resolution image including a human face. Learning a high-resolution face recognition network; extracting a first attention map associated with a high-resolution image from the trained high-resolution face recognition network; Transferring the face to a low-resolution face recognition network for recognizing a face and learning the low-resolution face recognition network using the transferred first attention map.

According to an embodiment of the present invention, the step of learning the low-resolution face recognition network includes extracting a second attention map from the low-resolution face recognition network and determining that the second attention map is similar to the first attention map by using knowledge distillation. and training a low-resolution face recognition network to be

According to an embodiment of the present invention, the step of learning the low-resolution face recognition network so that the second attention map is similar to the first attention map may include using the sum of the face recognition loss and the distillation loss in the low-resolution face recognition network. and training the face recognition network.

According to an embodiment of the present invention, a high-resolution face recognition network includes a plurality of sequentially connected blocks. The step of learning the high-resolution face recognition network includes extracting a first initial attention map from a first block included in a plurality of blocks, extracting a second initial attention map from a second block connected to the first block, and knowledge and training the high-resolution face recognition network to make the second initial attention map similar to the first initial attention map using distillation.

According to an embodiment of the present invention, the step of learning the high-resolution face recognition network so that the second initial attention map becomes similar to the first initial attention map,

and training a high-resolution face recognition network by here,

is the sum of the arc phase loss and the distillation loss in the high-resolution face recognition network,

represents the spatial attention value of the ith block of the high-resolution face recognition network,

denotes the distance function for the distillation loss,

denotes a max pooling layer.

According to an embodiment of the present invention, obtaining a high-resolution image including a human face, performing down-sampling on the obtained high-resolution image, performing blur processing on the down-sampled image, and blur processing The method may further include generating a low-resolution image by changing the size of the image to a size corresponding to the high-resolution image.

According to an embodiment of the present invention, the first attention map includes a channel attention map indicating a channel referenced for face recognition beyond a specific criterion and a spatial attention map indicating a feature region referenced for face recognition beyond another specific criterion. .

According to an embodiment of the present invention, a high-resolution face recognition network includes a plurality of blocks for extracting features of a high-resolution image and a plurality of attention modules for extracting a first attention map.

A computer program stored in a computer readable recording medium is provided to execute the above-described method according to an embodiment of the present invention on a computer.

In various embodiments of the present invention, the low-resolution face recognition network can be trained to generate a high-level attention map even when using low-resolution images, and thus the accuracy of face recognition using low-resolution images can be effectively improved.

In various embodiments of the present invention, the computing device can effectively improve the performance of a low-resolution face recognition network without additional parameters during training and without slowdown during inference.

In various embodiments of the present invention, due to the low computing power included in the driving robot, etc., even when only a low-resolution image is received, the low-resolution face recognition network generates a precise attention map and, accordingly, more accurately recognizes the face included in the low-resolution image. can do.

In various embodiments of the present invention, the attention map extracted from the high-resolution face recognition network and the attention map extracted from the low-resolution face recognition network may have a significantly high correlation, and accordingly, face recognition with high accuracy even when a low-resolution image is used. this can be done

In various embodiments of the present invention, learning of a low-resolution face recognition network can be efficiently performed by passing an attention map rather than a feature vector requiring a large capacity in a learning process.

The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned are clear to those skilled in the art (referred to as "ordinary technicians") from the description of the claims. will be understandable.

BRIEF DESCRIPTION OF THE DRAWINGS Embodiments of the present invention will be described with reference to the accompanying drawings described below, wherein like reference numbers indicate like elements, but are not limited thereto.

1 is a diagram illustrating an example of transferring an attention map between networks according to an embodiment of the present invention.

2 is a functional block diagram showing the internal configuration of a computing device according to an embodiment of the present invention.

3 is a diagram illustrating an example of a high-resolution face recognition network and a low-resolution face recognition network according to an embodiment of the present invention.

4 is a diagram illustrating an example of learning a high-resolution face recognition network according to an embodiment of the present invention.

5 is a diagram illustrating an example of learning a low-resolution face recognition network according to an embodiment of the present invention.

6 is a flowchart illustrating an example of an attention map transmission method according to an embodiment of the present invention.

7 is a block diagram showing an internal configuration of a computing device according to an embodiment of the present invention.

Hereinafter, specific details for the implementation of the present invention will be described in detail with reference to the accompanying drawings. However, in the following description, if there is a risk of unnecessarily obscuring the gist of the present invention, detailed descriptions of well-known functions or configurations will be omitted.

In the accompanying drawings, identical or corresponding elements are given the same reference numerals. In addition, in the description of the following embodiments, overlapping descriptions of the same or corresponding components may be omitted. However, omission of a description of a component does not intend that such a component is not included in an embodiment.

Advantages and features of the disclosed embodiments, and methods of achieving them, will become apparent with reference to the following embodiments in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and can be implemented in various different forms, only these embodiments make the present invention complete and the scope of the invention to those skilled in the art. It is provided only for complete information.

Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. The terms used in this specification have been selected from general terms that are currently widely used as much as possible while considering the functions in the present invention, but these may vary depending on the intention or precedent of a person skilled in the related field, the emergence of new technologies, and the like. In addition, in a specific case, there is also a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the invention. Therefore, the term used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, not simply the name of the term.

Expressions in the singular number in this specification include plural expressions unless the context clearly dictates that they are singular. Also, plural expressions include singular expressions unless the context clearly specifies that they are plural. When it is said that a certain part includes a certain component in the entire specification, this means that it may further include other components without excluding other components unless otherwise stated.

In the present invention, the terms "comprise", "comprising" and the like may indicate that features, steps, operations, elements and/or components are present, but may be used when such terms include one or more other functions, It is not excluded that steps, actions, elements, components, and/or combinations thereof may be added.

In the present invention, when a specific element is referred to as being “coupled”, “combined”, “connected”, or “reactive” to any other element, the specific element is directly bonded to, combined with, and/or other elements. or may be linked or reacted, but is not limited thereto. For example, one or more intermediate components may exist between certain components and other components. Also, in the present invention, “and/or” may include each of one or more items listed or a combination of at least a part of one or more items.

In the present invention, terms such as "first" and "second" are used to distinguish a specific component from other components, and the aforementioned components are not limited by these terms. For example, the “first” element may have the same or similar shape as the “second” element.

In the present invention, an 'attention map' is a matrix representing specific regions (eg, eyes, nose, ears, mouth, etc.) that affect face recognition among all regions in an image and/or a visualized image. etc. can be referred to. For example, the attention map may include a plurality of initial attention maps. Also, the attention map may include an attention map extracted from one image or a plurality of attention maps extracted from a plurality of images. Also, in the present invention, the attention value may include a numerical value, a vector, and the like associated with the attention map.

In the present invention, an 'attention module' may refer to a module for extracting an attention map from an image associated with a block. For example, the attention module may include, but is not limited to, a channel attention module (CAM), a spatial attention module (SAM), a convolution block attention module (CBAM), and the like.

In the present invention, 'knowledge distillation' may refer to a technique of improving the performance of a small model by transferring the learned knowledge of a large model to a small model. For example, knowledge distillation may be performed using a loss function or the like.

In the present invention, a 'face recognition network' may refer to a machine learning model, an artificial neural network, and the like for analyzing an image and identifying a person included in the image.

In the present invention, 'loss' and/or 'loss function' may refer to a scale, function, etc. for measuring an error of an object in a machine learning model or the like. A machine learning model or the like may be trained to reduce the error produced by the loss function. For example, the loss function may include face recognition loss, distillation loss, and the like. Here, the face recognition loss function may include a softmax loss function, a distance-based loss function, an angular margin-based loss function (sphereface, cosface, arcface), and the like. can

1 is a diagram illustrating an example in which an attention map 130 is transferred between

networks

110 and 140 according to an embodiment of the present invention. According to an embodiment, the

face recognition networks

110 and 140 may refer to a network for specifying a person included in a corresponding image using an image including a person's face, and may be implemented as a machine learning model or the like. can For example, the

face recognition networks

110 and 140 may identify a person included in an image using features such as the position, size, color, shape, and spacing of features of a person, but this Not limited.

In the illustrated example, a high-resolution face recognition network 110 that identifies a person included in an image using a high-resolution image and a low-resolution face recognition network 140 that specifies a person included in an image using a low-resolution image exist. can In general, specifying a person through a low-resolution image may have lower accuracy than specifying a person through a high-resolution image. For example, in the case of a low-resolution image, it may be difficult to accurately specify the position, size, color, and the like of a person's features.

According to an embodiment, the high-resolution face recognition network 110 may be trained to output a face recognition result 122 by receiving a plurality of high-resolution images 120 . For example, the high-resolution face recognition network 110 includes a plurality of blocks (eg, a plurality of convolutional blocks) for extracting features of a plurality of high-resolution images 120 and an attention map 130 for extracting features. It may be composed of a machine learning model including a plurality of attention modules. Here, the attention map may refer to a matrix representing specific regions (eg, eyes, nose, ears, mouth, etc.) that affect face recognition among all regions in the image and/or a visualized image. That is, the high-resolution face recognition network 110 may generate an attention map 130 based on a plurality of blocks and a plurality of attention modules, and learn to recognize a human face based on the generated attention map 130. .

As described above, when the high-resolution face recognition network 110 is trained, the attention map 130 associated with the plurality of high-resolution images 120 may be extracted from the trained high-resolution face recognition network 110 . Also, the attention map 130 extracted in this way may be delivered to the low-resolution face recognition network 140 . Here, the low-resolution face recognition network 140 may be trained to receive a plurality of low-resolution images 150 and output a face recognition result 152, and the attention map 130 transmitted in the learning process may be used. For example, the low-resolution face recognition network 140 may be trained using the attention map 130 through knowledge distillation.

According to an embodiment, the low-resolution face recognition network 140 includes a plurality of blocks (eg, a plurality of convolution blocks) and a plurality of attention modules for extracting attention maps suitable for features extracted from each convolution block. It can be configured as a machine learning model. That is, like the high-resolution face recognition network 110, the low-resolution face recognition network 140 generates attention maps based on a plurality of blocks and a plurality of attention modules, and recognizes a human face based on the generated attention maps. can be learned In general, when a low-resolution image is used, the accuracy of the attention map may be reduced compared to when a high-resolution image is used. In this regard, in order to improve the accuracy of the attention map, the attention map extracted from the low-resolution face recognition network 140 may be trained to be similar to the attention map 130 transmitted from the high-resolution face recognition network 110 . For example, other attention maps may be learned to be similar to the attention map 130 using a specific loss function.

In FIG. 1, it has been described in detail that the low-resolution face recognition network 140 is learned through knowledge distillation, but it is not limited thereto, and the high-resolution face recognition network 110 uses the initial attention generated in the later blocks among the plurality of blocks through knowledge distillation. The map can be learned to resemble the initial attention map created in the previous block. With this configuration, the low-resolution face recognition network 140 can be trained to generate a high-level attention map even when using low-resolution images, and thus the accuracy of face recognition using low-resolution images can be effectively improved. .

2 is a functional block diagram showing the internal configuration of a computing device 200 according to an embodiment of the present invention. As shown, the computing device 200 may include a low-resolution image generator 210, a high-resolution face recognition network learning unit 220, a low-resolution face recognition network learning unit 230, etc., but is not limited thereto. For example, the computing device 200 may communicate with an external device, a database, and the like, and receive an image for learning a network.

According to an embodiment, the low-resolution image generating unit 210 may generate a low-resolution image using a high-resolution image. For example, in order to train the attention map generated by the high-resolution face recognition network to be similar to the attention map generated by the low-resolution face recognition network, the images used to extract the corresponding attention maps are images including the same shape, It may be images with different resolutions. That is, when only a high-resolution image exists, the low-resolution image generation unit 210 may generate a low-resolution image by changing the resolution of the corresponding image.

The low-resolution image generation unit 210 may acquire a high-resolution image including a human face and perform downsampling on the obtained high-resolution image. Here, downsampling is to reduce the ratio, size, etc. of an image. For example, a high-resolution image may be downsampled at a rate of 2x, 4x, 8x, etc. through interpolation (eg, bicubic interpolation). there is. Also, the low-resolution image generation unit 210 may perform blur processing on the downsampled image. For example, a Gaussian blur technique may be applied to the image, but is not limited thereto. Then, the low-resolution image generation unit 210 may generate a low-resolution image by changing the size of the blurred image to a size corresponding to the high-resolution image. In other words, the low-resolution image generator 210 may generate a low-resolution image by changing the size of the blurred image to an original size corresponding to the high-resolution image through interpolation (eg, bicubic interpolation).

The high-resolution face recognition network learning unit 220 may train a high-resolution face recognition network for recognizing a human face based on a high-resolution image including a human face. For example, the high-resolution face recognition network may include a plurality of blocks (eg, convolution blocks) sequentially connected, and the high-resolution face recognition network learning unit 220 may include a first block included in the plurality of blocks. A first initial attention map may be extracted from , and a second initial attention map may be extracted from a second block connected to the first block. Then, the high-resolution face recognition network learning unit 220 may train the high-resolution face recognition network so that the second initial attention map becomes similar to the first initial attention map by using knowledge distillation. For example, an attention map created or configured in an early part of a block may include more context information than an attention map created or configured in a later part of a block. Accordingly, the high-resolution face recognition network learning unit 220 may perform training so that the second initial attention map generated at the rear of the block becomes similar to the first initial attention map generated at the front of the block.

According to one embodiment, a high-resolution face recognition network may be trained using a loss function. Here, the high-resolution face recognition network learning unit 220 may perform learning using Equation 1 below.

here,

denotes the distance function for the distillation loss,

may represent a max pooling layer. also,

may be a max pooling layer with a 2x2 kernel. For example, the size of the attention map of the i-th block constituting the high-resolution face recognition network may be twice the size of the i+1-th block, and accordingly, the max pooling layer downsamples the attention map to 1/2 size. can do.

Also, the distance function

Can be calculated by Equation 2 below.

Here, the distance function

may be a linear combination of the cosine distance and the LP norm, and the LP norm may include the L1 distance and the L2 distance. also,

may be a weighting factor for adjusting the LP norm and the cosine distance. Since the dimension of the attention map decreases from an initial block to a deeper block, the knowledge distillation process can be stabilized by using both the cosine distance and the LP norm distance. Additionally or alternatively, in FIG. 2 the distance function

Although is described above as being a linear combination of the cosine distance and the LP norm, it is not limited thereto, and an arbitrary distance function and/or a combination thereof may be used depending on the data set.

According to an embodiment, the low-resolution face recognition network learning unit 230 may train the low-resolution face recognition network using the first attention map transmitted from the high-resolution face recognition network. For example, the low-resolution face recognition network learning unit 230 extracts the second attention map from the low-resolution face recognition network and uses knowledge distillation to generate the low-resolution face recognition network so that the second attention map becomes similar to the first attention map. can be learned

According to one embodiment, a low-resolution face recognition network may be trained using a loss function. Here, the low-resolution face recognition network learning unit 230 may perform learning using the sum of face recognition loss and distillation loss in the low-resolution face recognition network. For example, distillation loss can be calculated using Equation 3 below.

here,

is the distillation loss in the low-resolution face recognition network,

and

denotes the spatial attention value of the ith block of the high-resolution face recognition network and the low-resolution face recognition network,

and

Represents the channel attention value of the i-th block of the high-resolution face recognition network and the low-resolution face recognition network,

represents the weight factor of the i-th block,

can represent the distance function for the distillation loss. Using such a loss function, the low-resolution face recognition network is trained to focus on a target region among face regions included in the low-resolution image, and can be trained to have performance similar to that of the high-resolution face recognition network even when only the low-resolution image is used. Additionally or alternatively, in FIG. 2 , it has been described above that the distillation loss is calculated using both the spatial attention value and the channel attention value, but is not limited thereto, and the spatial attention value or the channel attention value is independently transferred, or the spatial attention value , a channel attention value, and at least some of other arbitrary attention values may be delivered together.

Although each functional configuration included in the computing device 200 has been separately described in FIG. 2 , this is only to help understanding of the present invention, and one computing device may perform two or more functions. In addition, although the computing device 200 is described in FIG. 2 as learning both the high-resolution face recognition network and the low-resolution face recognition network, it is not limited thereto, and a separate device for learning each network may exist. With this configuration, the computing device 200 can effectively improve the performance of the low-resolution face recognition network without additional parameters during learning and without slowdown during inference. That is, the size of the reasoning network model does not increase before and after the knowledge transfer, and accordingly, the computing device 200 can perform face recognition with high accuracy by utilizing only the low-resolution face recognition network for which the knowledge transfer is completed in the reasoning step.

3 is a diagram illustrating examples of a high-resolution face recognition network 310 and a low-resolution face recognition network 330 according to an embodiment of the present invention. As discussed above, the high-resolution face recognition network 310 may be trained to perform face recognition 324 using a high-resolution image 320 comprising a human face. Here, the high-resolution face recognition network 310 may include a plurality of blocks for extracting features of a high-resolution image and a plurality of attention modules for extracting the attention map 322 . That is, the attention map 322 associated with the high-resolution image 320 may be extracted from the trained high-resolution face recognition network.

According to an embodiment, the attention map 322 may be used to extract features of a human face by Equation 4 below.

Here, F may be a feature map extracted from an image, and M(F) may be an attention map extracted from a corresponding image. Also, F' may be a feature map refined to focus on a specific region for face recognition by an attention map.

The attention map 322 includes a channel attention map (CAM) indicating a channel referenced for face recognition beyond a specific criterion and a spatial attention map (SAM) indicating a feature region referenced above another specific criterion for face recognition. attention map). According to an embodiment, the channel attention map may be generated by using a pooling layer to obtain an activated channel region by the channel attention module. Intermediate feature maps

When is satisfied, the channel attention map can be calculated by Equation 5 below.

here,

denotes the sigmoid function,

is the weight matrix

and

It can represent a fully connected (FC) layer having in this case,

and

is the pooling layer and

It can be shared by all ReLU activation functions associated with . Also, r may be a ratio for downsampling,

and

may represent outputs of the average pooling layer and the maximum pooling layer, respectively. also,

class

may represent a pooling layer with a 1x1 kernel.

In addition, the spatial attention map may be calculated by the spatial attention module using Equation 6 below.

here,

denotes the sigmoid function,

and

is a convolutional layer with a 7x7 kernel,

and

may be a layer that passes through concatenation.

The attention map 322 generated through the above process may be transmitted to the low-resolution face recognition network 330 . Here, the low-resolution face recognition network 330 may be a network for performing face recognition 346 using the low-resolution image 340 . Here, the low-resolution image 340 may have the same shape and/or shape as the high-resolution image 320, but may have a different resolution. According to one embodiment, another attention map 342 may be extracted from the low-resolution face recognition network 330 . In this case, another attention map 342 may be learned or distilled to be similar to the received attention map 322 and converted into a more precise attention map 344 .

In FIG. 3, it has been described in detail that the channel attention map and the spatial attention map are calculated respectively, but the present invention is not limited thereto, and the channel attention map and the spatial attention map are simultaneously generated or calculated by a convolution block attention module (CBAM) or the like. It can be. With this configuration, even when only a low-resolution image is received due to low computing power included in the driving robot or the like, the low-resolution face recognition network 330 generates a precise attention map, and accordingly, the face included in the low-resolution image more accurately. can recognize In other words, the low-resolution face recognition network 330 may perform high-performance face recognition using an image taken from a low-resolution image sensor. In addition, since the low-resolution face recognition network 330 can be used to build an operating system using low-cost IoT sensors in multiple robots and edge devices, hardware costs can be effectively reduced.

4 is a diagram illustrating an example of learning a high-resolution face recognition network according to an embodiment of the present invention. As described above, the high-resolution face recognition network may be trained to recognize a person's face based on the high-resolution image 420 including the person's face. According to an embodiment, the high-resolution face recognition network includes a plurality of blocks 410 for extracting features of a high-resolution image and attention modules corresponding to each block 410 (eg, channel attention module, spatial attention module, convolution block attention module, etc.). In other words, each block 410 may be associated with an attention module for extracting an attention map. That is, the attention map corresponding to each block 410 may be extracted by the attention module.

According to an embodiment, a first initial attention map is extracted from a first block (B1) (410_1) (eg, an attention module corresponding to the first block) included in the plurality of blocks 410_1, 410_2, 410_3, and 410_4. and a second initial attention map may be extracted from the second block (B2) 410_2 connected to the first block 410_1. In this case, the second initial attention map may be learned to be similar to the first initial attention map by using knowledge distillation.

In the illustrated example, the second initial attention map (

) is the first initial attention map (

) can be learned to be similar to In this case, the first initial attention map (

The attention size of ) is the second initial attention map (

) by a certain percentage (eg twice). Therefore, for knowledge distillation, the first initial attention map (

) may be reduced by a specific ratio using a max pooling layer. Then, a first initial attention map of the same size (

) and the second initial attention map (

), knowledge distillation can be performed.

In FIG. 4 , the high-resolution face recognition network is illustrated as including four blocks 410 and four attention modules, but is not limited thereto, and any number of blocks and attention modules may be included in the high-resolution face recognition network. In addition, although it has been described in FIG. 4 that an initial attention map is generated for one high-resolution image 420 and knowledge distillation is performed, the present invention is not limited thereto, and knowledge distillation may be performed for each of a plurality of high-resolution images.

5 is a diagram illustrating an example of learning a low-resolution face recognition network according to an embodiment of the present invention. As described above, a high-resolution face recognition network can be trained to perform face recognition using high-resolution images. Additionally, the low-resolution face recognition network may be trained to perform face recognition using the low-resolution image 520 . When learned in this way, the second attention map associated with the high-resolution face recognition network (

) may be generated, and the first attention map associated with the low-resolution face recognition network (

) can be created.

The low-resolution face recognition network (or the plurality of blocks 510 and the attention module included in the low-resolution face recognition network) receives a first signal from the high-resolution face recognition network (or the plurality of blocks 410 and the attention module included in the low-resolution face recognition network). Attention map (

) can be delivered. Then, the second attention map (

) may be learned to be similar to the first attention map using knowledge distillation. Here, the first attention map may include a plurality of initial attention maps corresponding to each block 410 of the high-resolution face recognition network, and the second attention map corresponds to each block 510 of the low-resolution face recognition network. may include a plurality of initial attention maps. That is, knowledge distillation may be performed in each block of the network, but is not limited thereto. In this way, learning of a low-resolution face recognition network can be efficiently performed by transmitting an attention map rather than a feature vector requiring a large capacity in the learning process.

In FIG. 5 , the high-resolution face recognition network and the low-resolution face recognition network are illustrated as including 4 blocks and 4 attention modules, but are not limited thereto, and any number of blocks and attention modules may be included in the network. In addition, although it has been described in FIG. 5 that an attention map is generated for one

image

420 and 520 in each network and knowledge distillation is performed, the present invention is not limited thereto, and knowledge distillation may be performed for each of a plurality of images. can With this configuration, the attention map extracted from the high-resolution face recognition network and the attention map extracted from the low-resolution face recognition network can have a significantly high correlation, and accordingly, even when using the low-resolution image 520, high accuracy can be achieved. Face recognition may be performed.

6 is a flowchart illustrating an example of an attention map transmission method 600 according to an embodiment of the present invention. The attention map transmission method 600 may be performed by a processor (eg, at least one processor of a computing device). As shown, the attention map transfer method 600 may be initiated by a processor learning a high-resolution face recognition network for recognizing a human face based on a high-resolution image including the human face (S610). For example, the processor extracts a first initial attention map from a first block included in a plurality of blocks, extracts a second initial attention map from a second block connected to the first block, and extracts a second initial attention map by using knowledge distillation. The high-resolution face recognition network may be trained to make the initial attention map similar to the first initial attention map.

The processor may extract a first attention map associated with the high-resolution image from the trained high-resolution face recognition network (S620). In addition, the processor may transmit the extracted first attention map to a low-resolution face recognition network for recognizing a human face based on a low-resolution image including the human face (S630). Here, the low-resolution image may be generated by a processor. For example, the processor may acquire a high-resolution image including a human face and perform down-sampling on the acquired high-resolution image. Then, the processor may perform blur processing on the downsampled image and change the size of the blurred image to a size corresponding to the high resolution image to generate a low resolution image.

The processor may learn the low-resolution face recognition network using the transferred first attention map (S640). For example, the processor may extract the second attention map from the low-resolution face recognition network and learn the low-resolution face recognition network to make the second attention map similar to the first attention map by using knowledge distillation.

7 is a block diagram showing an internal configuration of a computing device 700 according to an embodiment of the present invention. The computing device 700 may include a memory 710 , a processor 720 , a communication module 730 and an input/output interface 740 . As shown in FIG. 7 , the computing device 700 may be configured to communicate information and/or data over a network using a communication module 730 .

Memory 710 may include any non-transitory computer readable storage medium. According to one embodiment, the memory 710 is a non-perishable mass storage device (permanent mass storage device) such as random access memory (RAM), read only memory (ROM), disk drive, solid state drive (SSD), flash memory, and the like. mass storage device). As another example, a non-perishable mass storage device such as a ROM, SSD, flash memory, or disk drive may be included in the computing device 700 as a separate permanent storage device separate from memory. Also, an operating system and at least one program code may be stored in the memory 710 .

These software components may be loaded from a computer-readable recording medium separate from the memory 710 . A recording medium readable by such a separate computer may include a recording medium directly connectable to the computing device 700, for example, a floppy drive, a disk, a tape, a DVD/CD-ROM drive, a memory card, and the like. It may include a computer-readable recording medium. As another example, software components may be loaded into the memory 710 through the communication module 730 rather than a computer-readable recording medium. For example, at least one program may be loaded into the memory 710 based on a computer program installed by files provided by developers or a file distribution system that distributes application installation files through the communication module 730. can

The processor 720 may be configured to process commands of a computer program by performing basic arithmetic, logic, and input/output operations. Commands may be provided to a user terminal (not shown) or other external system by the memory 710 or the communication module 730 .

The communication module 730 may provide a configuration or function for a user terminal (not shown) and the computing device 700 to communicate with each other through a network, and the computing device 700 may provide an external system (for example, a separate cloud system). etc.) may provide a configuration or function to communicate with. For example, control signals, commands, data, etc. provided under the control of the processor 720 of the computing device 700 are transmitted through the communication module 730 and the network to the user terminal and/or to the user terminal through the communication module of the external system. and/or transmitted to an external system.

Also, the input/output interface 740 of the computing device 700 may be connected to the computing device 700 or may be a means for interface with a device (not shown) for input or output that may be included in the computing device 700. . In FIG. 7 , the input/output interface 740 is illustrated as an element separately configured from the processor 720 , but is not limited thereto, and the input/output interface 740 may be included in the processor 720 . Computing device 700 may include many more components than those of FIG. 7 . However, there is no need to clearly show most of the prior art components.

The processor 720 of the computing device 700 may be configured to manage, process, and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems.

The above-described methods and/or various embodiments may be realized with digital electronic circuits, computer hardware, firmware, software, and/or combinations thereof. Various embodiments of the present invention may be performed by a data processing device, eg, one or more programmable processors and/or one or more computing devices, or as a computer readable recording medium and/or a computer program stored on a computer readable recording medium. can be implemented The above-described computer programs may be written in any form of programming language, including compiled or interpreted languages, and may be distributed in any form, such as a stand-alone program, module, or subroutine. A computer program may be distributed over one computing device, multiple computing devices connected through the same network, and/or distributed over multiple computing devices connected through multiple different networks.

The methods and/or various embodiments described above may be performed by one or more processors configured to execute one or more computer programs that process, store, and/or manage any function, function, or the like, by operating on input data or generating output data. can be performed by For example, the method and/or various embodiments of the present invention may be performed by a special purpose logic circuit such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), and the method and/or various embodiments of the present invention may be performed. Apparatus and/or systems for performing the embodiments may be implemented as special purpose logic circuits such as FPGAs or ASICs.

The one or more processors executing the computer program may include a general purpose or special purpose microprocessor and/or one or more processors of any kind of digital computing device. The processor may receive instructions and/or data from each of the read-only memory and the random access memory, or receive instructions and/or data from the read-only memory and the random access memory. In the present invention, components of a computing device performing methods and/or embodiments may include one or more processors for executing instructions, and one or more memory devices for storing instructions and/or data.

According to one embodiment, a computing device may exchange data with one or more mass storage devices for storing data. For example, a computing device may receive/receive data from and transfer data to a magnetic or optical disc. A computer-readable storage medium suitable for storing instructions and/or data associated with a computer program includes semiconductor memory devices such as Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable PROM (EEPROM), and flash memory devices. Any type of non-volatile memory may be included, but is not limited thereto. For example, computer readable storage media may include magnetic disks such as internal hard disks or removable disks, magneto-optical disks, CD-ROM and DVD-ROM disks.

To provide interaction with a user, a computing device includes a display device (eg, a cathode ray tube (CRT), a liquid crystal display (LCD), etc.) It may include a pointing device (eg, a keyboard, mouse, trackball, etc.) capable of providing input and/or commands to, but is not limited thereto. That is, the computing device may further include any other type of device for providing interaction with a user. For example, a computing device may provide any form of sensory feedback to a user for interaction with the user, including visual feedback, auditory feedback, and/or tactile feedback. In this regard, the user may provide input to the computing device through various gestures such as visual, voice, and motion.

In the present invention, various embodiments may be implemented in a computing system including a back-end component (eg, a data server), a middleware component (eg, an application server), and/or a front-end component. In this case, the components may be interconnected by any form or medium of digital data communication, such as a communication network. For example, the communication network may include a local area network (LAN), a wide area network (WAN), and the like.

A computing device based on the example embodiments described herein may be implemented using hardware and/or software configured to interact with a user, including a user device, user interface (UI) device, user terminal, or client device. can For example, the computing device may include a portable computing device such as a laptop computer. Additionally or alternatively, the computing device may include personal digital assistants (PDAs), tablet PCs, game consoles, wearable devices, internet of things (IoT) devices, virtual reality (VR) devices, AR (augmented reality) device, etc. may be included, but is not limited thereto. A computing device may further include other types of devices configured to interact with a user. Further, the computing device may include a portable communication device (eg, a mobile phone, smart phone, wireless cellular phone, etc.) suitable for wireless communication over a network, such as a mobile communication network. A computing device communicates wirelessly with a network server using wireless communication technologies and/or protocols such as radio frequency (RF), microwave frequency (MWF) and/or infrared ray frequency (IRF). It can be configured to communicate with.

The various embodiments herein, including specific structural and functional details, are exemplary. Accordingly, embodiments of the present invention are not limited to those described above and may be implemented in various other forms. In addition, terms used in the present invention are for describing some embodiments and are not construed as limiting the embodiments. For example, the singular and the above may be construed to include the plural as well, unless the context clearly dictates otherwise.

In the present invention, unless defined otherwise, all terms used in this specification, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which such concept belongs. . In addition, terms commonly used, such as terms defined in a dictionary, should be interpreted as having a meaning consistent with the meaning in the context of the related technology.

Although the present invention has been described in relation to some embodiments in this specification, various modifications and changes can be made without departing from the scope of the present invention that can be understood by those skilled in the art. Moreover, such modifications and variations are intended to fall within the scope of the claims appended hereto.

Claims

As an attention map delivery method for improving face recognition performance of a low-resolution image performed by at least one processor,

learning a high-resolution face recognition network for recognizing a human face based on a high-resolution image including the human face;

extracting a first attention map associated with the high-resolution image from the trained high-resolution face recognition network;

transmitting the extracted first attention map to a low-resolution face recognition network for recognizing the human face based on a low-resolution image including the human face; and

learning the low-resolution face recognition network using the transferred first attention map;

Including, Attention map delivery method.
According to claim 1,

Learning the low-resolution face recognition network,

extracting a second attention map from the low-resolution face recognition network; and

learning the low-resolution face recognition network to make the second attention map similar to the first attention map using knowledge distillation;

Including, Attention map delivery method.
According to claim 2,

The step of learning the low-resolution face recognition network so that the second attention map is similar to the first attention map,

learning the low-resolution face recognition network using a sum of a face recognition loss and a distillation loss in the low-resolution face recognition network;

Including, Attention map delivery method.
According to claim 1,

The high-resolution face recognition network includes a plurality of blocks sequentially connected,

The step of learning the high-resolution face recognition network,

extracting a first initial attention map from a first block included in the plurality of blocks;

extracting a second initial attention map from a second block connected to the first block; and

training the high-resolution face recognition network to make the second initial attention map similar to the first initial attention map using knowledge distillation;

Including, Attention map delivery method.
According to claim 4,

The step of learning the high-resolution face recognition network so that the second initial attention map is similar to the first initial attention map,

Learning the high-resolution face recognition network by

here,
is the sum of the arc face loss and the distillation loss in the high-resolution face recognition network,
Represents the spatial attention value of the i-th block of the high-resolution face recognition network,
represents the distance function for the distillation loss,
An attention map delivery method, denoting a max pooling layer.
According to claim 1,

obtaining a high-resolution image including the human face;

performing downsampling on the obtained high-resolution image;

performing blur processing on the downsampled image; and

generating the low-resolution image by changing the size of the blurred image to a size corresponding to the high-resolution image;

Further comprising, an attention map delivery method.
According to claim 1,

Wherein the first attention map includes a channel attention map indicating a channel referenced for face recognition beyond a specific criterion and a spatial attention map indicating a feature region referenced above another specific criterion for face recognition.
According to claim 1,

The high-resolution face recognition network includes a plurality of blocks for extracting features of the high-resolution image and a plurality of attention modules for extracting the first attention map.
A computer program stored in a computer readable recording medium to execute the method according to any one of claims 1 to 8 on a computer.