CN113239844B

CN113239844B - Intelligent cosmetic mirror system based on multi-head attention target detection

Info

Publication number: CN113239844B
Application number: CN202110576729.3A
Authority: CN
Inventors: 刘斌毓; 张丽平; 夏劲松
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2022-11-01
Anticipated expiration: 2041-05-26
Also published as: CN113239844A

Abstract

The invention discloses an intelligent cosmetic mirror system based on multi-head attention target detection, which comprises an image acquisition unit, a target detection unit and a control unit, wherein the image acquisition unit is used for acquiring an image of a user; the image acquisition unit is used for acquiring a facial image of a user; the target detection unit extracts the features of the user image, detects different face areas, evaluates the makeup of the different face areas and makes up in a high-definition mode; the control unit combines the information fed back by the target detection unit to generate a current makeup preview image on the cosmetic mirror; on the basis of the traditional cosmetic mirror, target detection and a multi-head attention mechanism are added, an artificial intelligent middle-deep learning technology such as an antagonistic network is generated, different parts of a human face are identified by using the target detection technology, the makeup information of different parts is compared with a cloud database, the current makeup is scored, high-definition makeup of specific parts is performed, and the final makeup effect is predicted, so that a user can visually see the makeup trying effect of the user, and the time of the user is saved.

Description

An intelligent makeup mirror system based on multi-head attention target detection

技术领域technical field

本发明涉及化妆镜技术领域，尤其涉及一种基于多头注意力目标检测的智能化妆镜系统。The invention relates to the technical field of cosmetic mirrors, in particular to an intelligent cosmetic mirror system based on multi-head attention target detection.

背景技术Background technique

化妆镜是女性群体日常生活的必需品，传统的化妆镜功能单一，只具备成像或者补光的基础功能。当用户只是想尝试某些妆容时，必须完整的在脸上化完妆容，再擦拭，这样一来容易对用户的皮肤造成极大的损坏，二来不仅浪费了大量的化妆品，而且还浪费用户的时间。还有当用户在化妆的过程中，用户并不知道自己当前的化妆效果是否达到了预期，用户也并不了解哪款化妆品更适合当前的妆容。因此，如何提高化妆镜的使用效率是目前亟待解决的问题。Cosmetic mirrors are a necessity in the daily life of women. Traditional cosmetic mirrors have a single function and only have the basic functions of imaging or supplementary light. When the user just wants to try some makeup, he must completely apply the makeup on his face and then wipe it off. This will easily cause great damage to the user's skin, and secondly, not only waste a lot of cosmetics, but also waste the user time. Also, when the user is in the process of makeup, the user does not know whether the current makeup effect has reached the expectation, and the user does not know which cosmetic is more suitable for the current makeup. Therefore, how to improve the service efficiency of the vanity mirror is an urgent problem to be solved at present.

利用基于多头注意力目标检测，生成对抗网络技术设计一种基于多头注意力目标检测的智能化妆镜系统，可以解决上述问题，有效的提高了化妆镜的使用效率。Using multi-head attention target detection and generative adversarial network technology to design an intelligent makeup mirror system based on multi-head attention target detection, it can solve the above problems and effectively improve the use efficiency of the makeup mirror.

发明内容Contents of the invention

本发明提供一种基于多头注意力目标检测的智能化妆镜系统，以解决现有技术的不足。The invention provides an intelligent makeup mirror system based on multi-head attention target detection to solve the deficiencies in the prior art.

智能化妆镜系统由图像采集单元，目标检测单元，控制单元组成。The smart makeup mirror system consists of an image acquisition unit, a target detection unit, and a control unit.

所述图像采集单元通过外置摄像头，在控制单元的控制下采集用户的人脸图像。The image acquisition unit collects the user's face image under the control of the control unit through an external camera.

所述目标检测单元在控制单元的控制下，对于采集到的人脸图像，利用残差网络进行人脸特征提取，利用基于多头注意力机制的编码神经网络识别不同的人脸区域，利用生成对抗网络得到最终妆容效果图。Under the control of the control unit, the target detection unit uses a residual network to extract face features from the collected face images, uses a coding neural network based on a multi-head attention mechanism to identify different face regions, and uses generative confrontation Get the final makeup renderings from the Internet.

其所述利用基于多头注意力机制的编码神经网络识别不同的人脸区域，将残差网络输出的人脸特征图和人脸图像中的人脸不同区域的编码信息进行哈达玛积，其中人脸不同区域的编码信息是一个与人脸特征图具有相同维度的、随机初始化的一个可学习的张量。将哈达玛积的结果输入到序列嵌入层,将得到的结果输入到多头注意力编码神经网络中，通过不同的全连接层预测出人脸不同部位的预测框,将最终的预测框信息输入到控制单元，控制单元在化妆镜上根据不同的预测框标记人脸的不同部位,并捕获不同部位的妆容信息,传输到云服务器进行妆容效果评定。It uses the encoding neural network based on the multi-head attention mechanism to identify different face regions, and performs Hadamard product on the face feature map output by the residual network and the encoding information of different regions of the face in the face image, where the human The encoding information of different regions of the face is a randomly initialized learnable tensor with the same dimensionality as the face feature map. Input the result of Hadamard product into the sequence embedding layer, input the obtained result into the multi-head attention encoding neural network, predict the prediction frame of different parts of the face through different fully connected layers, and input the final prediction frame information into The control unit, the control unit marks different parts of the face on the makeup mirror according to different prediction frames, and captures the makeup information of different parts, and transmits it to the cloud server for evaluation of the makeup effect.

所述利用生成对抗网络得到最终妆容效果图，其中生成对抗网络由生成器和判别器组成。生成器的输入是残差网络最终输出的特征图,输出一幅预测人脸化妆之后的效果图，判别器的输入是生成器输出的化妆效果图，当判别器计算生成器生成的化妆效果图的可信度大于阈值时,输出此化妆效果图到控制单元,控制单元会将化妆效果图显示在化妆镜上,否则,判别器会将此化妆效果图重新输入到生成器中，要求生成器重新生成化妆效果图，直到其生成的化妆效果图可信度大于阈值。The final makeup effect map is obtained by using the generation confrontation network, wherein the generation confrontation network is composed of a generator and a discriminator. The input of the generator is the feature map of the final output of the residual network, which outputs a predicted face makeup effect map. The input of the discriminator is the makeup effect map output by the generator. When the discriminator calculates the makeup effect map generated by the generator When the confidence level of is greater than the threshold, output the makeup effect map to the control unit, and the control unit will display the makeup effect map on the makeup mirror, otherwise, the discriminator will re-input the makeup effect map into the generator, requiring the generator Regenerate the makeup effect map until the confidence of the generated makeup effect map is greater than the threshold.

所述控制单元包含无线通讯模块，语音输入模块，语音输出模块，时钟模块，存储模块，算术逻辑运算模块，微程序转换模块。The control unit includes a wireless communication module, a voice input module, a voice output module, a clock module, a storage module, an arithmetic logic operation module, and a microprogram conversion module.

所述控制单元包含的无线通讯模块用于建立控制单元和云端服务器，移动终端，移动网络之间的连接，实现控制单元和相应部分的数据交互和控制交互。The wireless communication module contained in the control unit is used to establish the connection between the control unit and the cloud server, the mobile terminal, and the mobile network, so as to realize data interaction and control interaction between the control unit and corresponding parts.

所述控制单元包含的语音输入模块接收用户的语音控制命令，完成相应命令的执行和响应。The voice input module contained in the control unit receives the user's voice control command, and completes the execution and response of the corresponding command.

所述控制单元包含的语音输出模块将内部控制信号转化为语音信息输出到扬声器，提示用户相关信息。The voice output module included in the control unit converts the internal control signal into voice information and outputs it to the speaker to prompt the user for relevant information.

所述控制单元包含的时钟模块产生周期性的脉冲信号，使控制单元中的组件可以有序执行命令。The clock module contained in the control unit generates periodic pulse signals, so that the components in the control unit can execute commands in order.

所述控制单元包含的存储模块用于存储微程序运行中产生的数据，控制命令的执行结果，目标检测单元产生的结果，及目标检测各个网络的输入数据。The storage module included in the control unit is used to store the data generated during the operation of the microprogram, the execution result of the control command, the result generated by the target detection unit, and the input data of each network of target detection.

所述控制单元的算术逻辑运算模块进行相应的算术运算和逻辑运算。The arithmetic and logic operation module of the control unit performs corresponding arithmetic and logic operations.

所述控制单元包含的微程序转换模块用于将其它程序转换为控制单元可执行的微程序。The microprogram conversion module contained in the control unit is used to convert other programs into executable microprograms of the control unit.

一种基于多头注意力目标检测的智能化妆镜系统，包含以下步骤：An intelligent makeup mirror system based on multi-head attention target detection, comprising the following steps:

S1.利用目标检测单元中的残差网络进行人脸特征提取；S1. Use the residual network in the target detection unit to extract face features;

S2.将残差网络提取到的人脸特征输入到基于多头注意力机制的编码神经网络中，获取人脸不同区域部位的预测框；S2. Input the face features extracted by the residual network into the encoding neural network based on the multi-head attention mechanism, and obtain the prediction frames of different regions of the face;

S3.将残差网络提取到的人脸特征输入到生成对抗网络中，获取妆容效果图。S3. Input the facial features extracted by the residual network into the generative adversarial network to obtain a makeup effect map.

优选的，步骤S1所述利用目标检测单元中的残差网络进行人脸特征提取，其中的残差网络包含3个第一个残差模块，4个第二个残差模块，6个第三个残差模块，3个第四个残差模块，各个残差模块之间进行残差连接；其中第一个残差模块包含一个1*1*64的卷积层，一个3*3*64的卷积层，一个1*1*256的卷积层；第二个残差模块包含一个1*1*128的卷积层，一个3*3*128的卷积层，一个1*1*512的卷积层；第三个残差模块包含一个1*1*256的卷积层，一个3*3*256的卷积层，一个1*1*1024的卷积层；第四个残差模块包含一个1*1*512的卷积层，一个3*3*512的卷积层，一个1*1*2048的卷积层；最终残差网络输出7*7*2048的特征图。残差网络人脸特征提取过程如下：Preferably, the residual network in the target detection unit described in step S1 is used for face feature extraction, wherein the residual network includes 3 first residual modules, 4 second residual modules, and 6 third residual modules. A residual module, three fourth residual modules, and residual connections between each residual module; the first residual module contains a 1*1*64 convolutional layer, a 3*3*64 The convolutional layer, a 1*1*256 convolutional layer; the second residual module contains a 1*1*128 convolutional layer, a 3*3*128 convolutional layer, a 1*1* 512 convolutional layer; the third residual module contains a 1*1*256 convolutional layer, a 3*3*256 convolutional layer, and a 1*1*1024 convolutional layer; the fourth residual The difference module includes a 1*1*512 convolutional layer, a 3*3*512 convolutional layer, and a 1*1*2048 convolutional layer; the final residual network outputs a 7*7*2048 feature map. The face feature extraction process of the residual network is as follows:

利用残差网络对人脸图像进行人脸特征提取，包含如下步骤：Using the residual network to extract face features from face images includes the following steps:

(1)通过残差网络多个卷积层提取人脸图像的特征信息，第l个卷积层的特征提取公式如下：(1) Extract the feature information of the face image through multiple convolutional layers of the residual network. The feature extraction formula of the lth convolutional layer is as follows:

其中W^l∈R^c×w×h为残差网络中可学习的三维卷积核，

代表卷积操作，X^l-1是第l-1个卷积层的输出，作为第l个卷积层的输入，当l＝1时，X⁰表示输入的图像采集单元采集到的人脸图像数据，b^l∈R^c×w×h为一个随机初始化的卷积偏置，f(·)为Relu激活函数，Y^l是第l个卷积层输出的人脸特征。其中Relu激活函数计算公式为：where W ^l ∈ R ^c×w×h is a learnable three-dimensional convolution kernel in the residual network,

Represents the convolution operation, X ^1-1 is the output of the l-1th convolutional layer, as the input of the lth convolutional layer, when l=1, X ⁰ represents the face collected by the input image acquisition unit Image data, b ^l ∈ R ^c×w×h is a randomly initialized convolution bias, f( ) is the Relu activation function, and Y ^l is the face feature output by the lth convolutional layer. The calculation formula of the Relu activation function is:

(2)将不同残差模块输出的人脸特征进行残差连接增强残差网络的特征提取能力，第l个残差模块的残差连接计算公式为：

其中X^l是第l个残差模块的输入，Y^l是第l个残差模块的输出，

表示矩阵逐元素相加，

是经过残差连接后最终第l个残差模块的输出。(2) The face features output by different residual modules are residually connected to enhance the feature extraction ability of the residual network. The residual connection calculation formula of the first residual module is:

where X ^l is the input of the l-th residual module, Y ^l is the output of the l-th residual module,

Indicates matrix element-wise addition,

is the output of the final lth residual module after the residual connection.

优选的，步骤S2所述将残差网络提取到的人脸特征输入到基于多头注意力机制的编码神经网络中，获取人脸不同区域部位的预测框，其中基于多头注意力机制的编码神经网络包含一个序列嵌入层，一个多头注意力层，一个对于多头注意力层的输出进行正则化的网络层，一个前馈神经网络层，一个对于前馈神经网络的输出进行正则化的网络层；基于多头注意力机制的编码神经网络的输入是残差网络最终输出的7*7*2048的特征图，特征图经序列嵌入层转换为49个1*2048的序列型数据输入到多头注意力层，多头注意力层捕获序列数据中的人脸区域特征，经正则化层输入到前馈神经网络中，经正则化后，得到最终的关于人脸区域特征的序列数据。利用目标检测单元生成的人脸区域目标预测框，将人脸图像不同部分妆容信息通过控制单元无线通讯模块传输到云端服务器，云端服务器通过和云端数据库中的相关妆容信息进行对比给出当前用户妆容的评分，控制单元接收后通过化妆镜反馈给用户。对于过低的妆容评分，服务器会根据当前人脸区域推荐相应的化妆品给用户。其中：Preferably, the face features extracted by the residual network in step S2 are input into the encoding neural network based on the multi-head attention mechanism, and the prediction frames of different regions of the face are obtained, wherein the encoding neural network based on the multi-head attention mechanism Contains a sequence embedding layer, a multi-head attention layer, a network layer that regularizes the output of the multi-head attention layer, a feed-forward neural network layer, and a network layer that regularizes the output of the feed-forward neural network; based on The input of the encoding neural network of the multi-head attention mechanism is the 7*7*2048 feature map finally output by the residual network. The feature map is converted into 49 1*2048 sequential data by the sequence embedding layer and input to the multi-head attention layer. The multi-head attention layer captures the facial area features in the sequence data, and inputs them into the feedforward neural network through the regularization layer. After regularization, the final sequence data about the facial area features is obtained. Using the target prediction frame of the face area generated by the target detection unit, the makeup information of different parts of the face image is transmitted to the cloud server through the wireless communication module of the control unit. The cloud server compares the relevant makeup information in the cloud database to give the current user makeup. After the score is received by the control unit, it will be fed back to the user through the vanity mirror. For a makeup score that is too low, the server will recommend corresponding cosmetics to the user based on the current face area. in:

(1)将残差网络最终输出的人脸特征图输入到多头注意力编码神经网络中，多头注意力编码神经网络的多头注意力计算公式为：(1) Input the face feature map finally output by the residual network into the multi-head attention encoding neural network. The multi-head attention calculation formula of the multi-head attention encoding neural network is:

其中K_i、V_i、q_i表示输入的人脸特征向第i个特征空间投影得到的键矩阵、值矩阵、以及查询向量矩阵，W^k，W^v，W^q，v^T为键、值、查询向量、激活向量的可学习投影变换矩阵，att((K_i,V_i),q_i)表示第i个注意力头的注意力得分，att((K,V),Q)表示最终的多头注意力得分，h是注意力的头数，

表示矩阵逐元素相加，tanh(·)，soft max(·)为激活函数。Among them, K _i , V _i , and q _i represent the key matrix, value matrix, and query vector matrix obtained by projecting the input face features to the i-th feature space, W ^k , W ^v , W ^q , and v ^T are keys and values , query vector, activation vector learnable projection transformation matrix, att((K _i ,V _i ),q _i ) represents the attention score of the i-th attention head, att((K,V),Q) represents the final The multi-head attention score of h is the number of attention heads,

Indicates that the matrix is added element by element, tanh(·), soft max(·) is the activation function.

(2)定义人脸区域目标检测的损失函数为：(2) Define the loss function of face area target detection as:

其中Y是真实的人脸区域框的值，

是多头注意力编码神经网络的预测值，c_i表示第i个人脸区域，

是第i个人脸区域的类别预测值，

表示当第i个人脸区域非空时的单位向量，b_i是第i个人脸区域框的真实值，

是第i个人脸区域预测框的值，L_box(·)表示常用的预测框损失函数，比如MAE，利用定义的人脸区域目标检测损失函数结合梯度下降算法对基于多头注意力机制的编码神经网络进行训练，直到网络收敛。Where Y is the value of the real face area frame,

is the prediction value of the multi-head attention encoding neural network, c _i represents the i-th face area,

is the category prediction value of the i-th face region,

Indicates the unit vector when the i-th face area is not empty, b _i is the real value of the i-th face area frame,

is the value of the i-th face area prediction box, L _box ( ) represents a commonly used prediction box loss function, such as MAE, using the defined face area target detection loss function combined with the gradient descent algorithm to encode the neural network based on the multi-head attention mechanism The network is trained until the network converges.

优选的，步骤S3所述将残差网络提取到的人脸特征输入到生成对抗网络中，获取妆容效果图步骤如下：Preferably, in step S3, the facial features extracted by the residual network are input into the generative confrontation network, and the steps of obtaining the makeup effect map are as follows:

(1)将残差网络最终输出的人脸特征图输入到生成对抗网络的生成器中，生成器通过多个不同的卷积层和池化层，生成用户当前妆容的最终效果预测图像；(1) Input the face feature map finally output by the residual network into the generator of the generative confrontation network, and the generator generates the final effect prediction image of the user's current makeup through multiple different convolutional layers and pooling layers;

(2)将生成器生成的用户当前妆容最终效果预测图像输入到判别器中，判别器利用多个卷积层和池化层，提取图像的特征，将特征输入到控制单元中，控制单元结合云端数据库查询最佳符合当前妆容特征的化妆效果图，反馈给判别器，判别器利用逐像素点内积计算生成器生成的用户当前妆容最终效果预测图和控制单元反馈给它的化妆效果图的相似性，经交叉熵损失得到生成器生成的用户当前妆容最终效果预测图额的置信度，如果置信度大于预先设定的阈值，则将生成器生成的用户当前妆容最终效果预测图输入到控制单元，控制单元绘制在化妆镜上，如果小于阈值将生成器生成的用户当前妆容最终效果预测图输入到生成器中，让生成器重新生成用户当前妆容最终效果预测图，直到大于预先设定的阈值。(2) Input the predicted image of the final effect of the user's current makeup generated by the generator into the discriminator, and the discriminator uses multiple convolutional layers and pooling layers to extract the features of the image, and input the features to the control unit, and the control unit combines The cloud database queries the makeup effect map that best matches the current makeup features, and feeds it back to the discriminator. The discriminator uses the pixel-by-pixel inner product calculation generator to generate the final effect prediction map of the user's current makeup and the makeup effect map fed back to it by the control unit. Similarity, through the cross-entropy loss, the confidence of the user's current makeup final effect prediction map generated by the generator is obtained. If the confidence is greater than the preset threshold, the user's current makeup final effect prediction map generated by the generator is input to the control panel. unit, the control unit is drawn on the makeup mirror, if it is less than the threshold value, input the final effect prediction map of the user's current makeup generated by the generator into the generator, and let the generator regenerate the final effect prediction map of the user's current makeup until it is greater than the preset threshold.

本发明的有益效果是：相比于其它目标检测方法，本发明直接利用图像的特征信息进行目标框的预测，省去了传统的目标检测方法中人为进行锚框的设计过程，真正的实现了端到端的目标检测。利用这种目标检测技术可以帮助用户更加快速地了解当前的妆容信息，节省了用户的大量时间，提高用户的使用体验。The beneficial effect of the present invention is: compared with other target detection methods, the present invention directly uses the feature information of the image to predict the target frame, which saves the artificial anchor frame design process in the traditional target detection method, and truly realizes End-to-end object detection. Utilizing this target detection technology can help users understand the current makeup information more quickly, save a lot of time for users, and improve user experience.

附图说明Description of drawings

图1是本发明所用的目标检测单元人脸特征提取残差网络示意图。FIG. 1 is a schematic diagram of a face feature extraction residual network of a target detection unit used in the present invention.

图2是本发明所用的目标检测单元基于多头注意力机制的目标检测网络示意图。Fig. 2 is a schematic diagram of a target detection network based on a multi-head attention mechanism of the target detection unit used in the present invention.

图3是本发明所用的目标检测单元生成对抗网络中的生成器示意图。Fig. 3 is a schematic diagram of a generator in a target detection unit generating an adversarial network used in the present invention.

图4是本发明所用的目标检测单元生成对抗网络中的判别器示意图。Fig. 4 is a schematic diagram of a discriminator in a target detection unit generating an adversarial network used in the present invention.

附图标记说明：1-人脸图像数据；2-第一个残差模块；3-第二个残差模块；4-第三个残差模块；5-第四个残差模块；6-序列嵌入层；7-多头注意力目标检测；8-目标预测框；9-第五个残差模块；10-第六个残差模块；11-第七个残差模块；12-第八个残差模块；13-第一个平均池化层；14-第九个残差模块；15-第十个残差模块；16-第十一个残差模块；17-第十二个残差模块；18-第二个平均池化层。Explanation of reference signs: 1-face image data; 2-the first residual module; 3-the second residual module; 4-the third residual module; 5-the fourth residual module; 6- Sequence embedding layer; 7-multi-head attention target detection; 8-target prediction box; 9-fifth residual module; 10-sixth residual module; 11-seventh residual module; 12-eighth Residual module; 13-first average pooling layer; 14-ninth residual module; 15-tenth residual module; 16-eleventh residual module; 17-twelfth residual Module; 18 - second average pooling layer.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图和具体实施方式对本发明做进一步阐述和说明。显然，所描述的实例仅仅是本发明一部分实施例子，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The present invention will be further elaborated and described below in conjunction with the drawings and specific implementation methods in the embodiments of the present invention. Apparently, the described examples are only some implementation examples of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，本发明利用残差网络对于人脸图像进行特征提取，包括如下步骤：As shown in Figure 1, the present invention utilizes residual network to carry out feature extraction to face image, comprises the following steps:

(1)图像采集单元获得的多幅图像作为一个批次输入到残差网络中；(1) Multiple images obtained by the image acquisition unit are input into the residual network as a batch;

(2)图像首先通过第一个残差模块2进行特征提取，其中各个残差子网络之间进行残差连接，经多个1*1，3*3的卷积核，最终第一个残差模块2输出56*56*256的人脸特征图；(2) The image first performs feature extraction through the first residual module 2, in which residual connections are made between each residual sub-network, and after multiple 1*1, 3*3 convolution kernels, the first residual Difference module 2 outputs a face feature map of 56*56*256;

(3)将56*56*256的人脸特征图输入到第二个残差模块3中，其中各个残差子网络之间进行残差连接，经多个1*1，3*3的卷积核，最终第二个残差模块3输出28*28*512的人脸特征图；(3) Input the face feature map of 56*56*256 into the second residual module 3, in which residual connections are made between each residual sub-network, through multiple 1*1, 3*3 volumes Accumulation, and finally the second residual module 3 outputs a 28*28*512 face feature map;

(4)将28*28*512的人脸特征图输入到第三个残差模块4中，其中各个残差子网络之间进行残差连接，经多个1*1，3*3的卷积核，最终第三个残差模块4输出14*14*1024的人脸特征图；(4) Input the face feature map of 28*28*512 into the third residual module 4, in which residual connections are made between the residual sub-networks, and multiple volumes of 1*1 and 3*3 Accumulation, and finally the third residual module 4 outputs a face feature map of 14*14*1024;

(5)将14*14*1024的人脸特征图输入到第四个残差模块5中，其中各个残差子网络之间进行残差连接，经多个1*1，3*3的卷积核，最终第四个残差模块5输出7*7*2048的人脸特征图。(5) Input the face feature map of 14*14*1024 into the fourth residual module 5, in which residual connections are made between the residual sub-networks, and multiple volumes of 1*1 and 3*3 Accumulate the core, and finally the fourth residual module 5 outputs a face feature map of 7*7*2048.

如图2所示，本发明利用多头注意力目标检测网络对于人脸不同区域进行识别，包括如下步骤：As shown in Figure 2, the present invention uses a multi-head attention target detection network to identify different areas of the face, including the following steps:

(1)将人脸特征图和人脸图像中的人脸不同区域的编码信息进行哈达玛积,将哈达玛积结果输入到序列嵌入层6，将序列嵌入层6的结果输入到多头注意力编码神经网络7中，分别用键、值、查询的投影矩阵对序列嵌入层6的结果进行不同子空间投影；(1) Perform Hadamard product on the face feature map and the encoding information of different areas of the face in the face image, input the result of Hadamard product into sequence embedding layer 6, and input the result of sequence embedding layer 6 into multi-head attention In the encoding neural network 7, different subspace projections are performed on the results of the sequence embedding layer 6 with key, value, and query projection matrices;

(2)利用缩放点积计算不同头的注意力得分；(2) Compute the attention scores of different heads using the scaled dot product;

(3)通过soft max(·)激活函数将注意力得分进行归一化；(3) Normalize the attention score by the soft max(·) activation function;

(4)对于不同头的注意力得分分别进行归一化和正则化，将最终序列结果8输入到前馈神经网络中；(4) Normalize and regularize the attention scores of different heads respectively, and input the final sequence result 8 into the feedforward neural network;

(5)通过不同的全连接层预测出人脸不同部位的预测框，将最终的预测框信息输入到控制单元，控制单元在化妆镜上根据不同的预测框标记人脸的不同部位，并捕获不同部位的妆容信息，传输到云服务器进行妆容效果评定。(5) Predict the prediction frames of different parts of the face through different fully connected layers, input the final prediction frame information to the control unit, and the control unit marks different parts of the face on the makeup mirror according to different prediction frames, and captures The makeup information of different parts is transmitted to the cloud server for evaluation of makeup effects.

如图3、图4所示，本发明利用生成对抗网络的生成器和判别器对于人脸的最终妆容效果进行预测，包括如下步骤：As shown in Fig. 3 and Fig. 4, the present invention uses the generator and the discriminator of the generative confrontation network to predict the final makeup effect of the face, including the following steps:

(1)将残差网络最终输出的人脸特征图输入到生成对抗网络生成器中，生成器通过多个不同的3*3的卷积层9，10，11，12，和2*2的池化层13，生成用户当前妆容最终效果预测预测图像；(1) Input the face feature map finally output by the residual network into the generator of the generation confrontation network, and the generator passes through multiple different 3*3 convolutional layers 9, 10, 11, 12, and 2*2 The pooling layer 13 generates a prediction image of the final effect of the user's current makeup;

(2)将生成器生成的用户当前妆容最终效果预测图像输入到判别器中，判别器利用多个3*3卷积层14，15，16，17，2*2的池化层18，提取图像的特征，将特征输入到控制单元中，控制单元结合云端数据库查询最佳符合当前妆容特征的化妆效果图，反馈给判别器，判别器利用逐像素点的内积计算生成器生成的用户当前妆容最终效果预测图和控制单元反馈给它的化妆效果图的相似性，经交叉熵损失得到生成器生成的用户当前妆容最终效果预测图额的置信度，如果置信度大于预先设定的阈值0.8，则将生成器生成的用户当前妆容最终效果预测图输入到控制单元，控制单元绘制在化妆镜上，如果小于阈值将生成器生成的用户当前妆容最终效果预测图输入到生成器中，让生成器重新生成用户当前妆容最终效果预测图，直到大于预先设定的阈值。(2) Input the predicted image of the user's current makeup final effect generated by the generator into the discriminator, and the discriminator uses multiple 3*3 convolutional layers 14, 15, 16, 17, and 2*2 pooling layers 18 to extract The characteristics of the image, input the characteristics into the control unit, the control unit combines the cloud database to query the makeup effect picture that best matches the current makeup characteristics, and feeds back to the discriminator, the discriminator uses the pixel-by-pixel inner product to calculate the user’s current makeup image generated by the generator. The similarity between the final makeup effect prediction map and the makeup effect map fed back to it by the control unit, and the confidence of the user's current makeup final effect prediction map generated by the generator through the cross-entropy loss, if the confidence is greater than the preset threshold 0.8 , then input the prediction map of the final effect of the user's current makeup generated by the generator to the control unit, and the control unit draws it on the makeup mirror. The device regenerates the final effect prediction map of the user's current makeup until it is greater than the preset threshold.

实施例Example

为了验证多头注意力目标检测的有效性，在著名的目标检测数据集COCO上与当前的最佳的目标检测方法Faster R-CNN网络进行了对比。其中Faster RCNN-FPN是带有区域建议的Faster R-CNN网络，Faster RCNN-R101-FPN是以ResNet101为基干网络的Raster R-CNN。实验结果如表1所示,其中AP代表平均准确度，AP-50代表IoU阈值为0.5时的AP测量值，AP-75代表IoU阈值为0.75时的测量值，AP-S代表像素面积小于32*32时的目标框的AP测量值，AP-M代表像素面积在32*32-96*96之间目标框的测量值，AP-L代表像素面积大于96*96的目标框的AP测量值。In order to verify the effectiveness of multi-head attention target detection, the famous target detection dataset COCO was compared with the current best target detection method Faster R-CNN network. Among them, Faster RCNN-FPN is a Faster R-CNN network with regional recommendations, and Faster RCNN-R101-FPN is a Raster R-CNN with ResNet101 as the backbone network. The experimental results are shown in Table 1, where AP represents the average accuracy, AP-50 represents the measured value of AP when the IoU threshold is 0.5, AP-75 represents the measured value when the IoU threshold is 0.75, and AP-S represents the pixel area less than 32 *AP measurement value of the target frame at 32, AP-M represents the measurement value of the target frame with a pixel area between 32*32-96*96, AP-L represents the AP measurement value of a target frame with a pixel area greater than 96*96 .

表1本发明针对多头注意力目标检测的有效性实验测试结果。Table 1 The effectiveness experimental test results of the present invention for multi-head attention target detection.

Claims

1. An intelligent cosmetic mirror system based on multi-head attention target detection, characterized in that it comprises: an image acquisition unit, a target detection unit, and a control unit; the face image; the target detection unit is under the control of the control unit, for the collected face image, utilizes the residual network to carry out face feature extraction, and the extracted face features are respectively input into the multi-head attention mechanism based In the generator of the encoding neural network and the generating confrontation network; the Hadamard product is performed on the face feature map and the encoding information of different regions of the face in the face image, and the Hadamard product result is input into the sequence embedding layer, and the result is input into Into the multi-head attention encoding neural network; predict the prediction frame of different parts of the face through different fully connected layers, input the final prediction frame information to the control unit, and the control unit marks the face on the makeup mirror according to different prediction frames Different parts of the face, and capture the makeup information of different parts, and transmit it to the cloud server for makeup effect evaluation. The input of the generator is the feature map finally output by the residual network, and output a predicted effect map after face makeup; The input of the generator is the makeup effect map output by the generator. When the discriminator calculates that the credibility of the makeup effect map generated by the generator is greater than the threshold, the makeup effect map is output to the control unit, and the control unit will display the makeup effect map on the makeup effect map. On the mirror, otherwise, the discriminator will re-input this makeup effect map into the generator, and ask the generator to regenerate the makeup effect map until the credibility of the generated makeup effect map is greater than the threshold; through the control of the control unit, the target The result of the detection unit is displayed on the cosmetic mirror and presented to the user; the control unit includes a wireless communication module, a voice input module, a voice output module, a clock module, a storage module, an arithmetic logic operation module, and a microprogram conversion module.

2. A kind of intelligent makeup mirror system based on multi-head attention target detection according to claim 1, characterized in that: the residual network of the target detection unit includes 3 first residual modules, 4 second A residual module, 6 third residual modules, 3 fourth residual modules, and residual connection between each residual module; the first residual module contains a volume of 1*1*64 Product layer, a 3*3*64 convolutional layer, a 1*1*256 convolutional layer; the second residual module contains a 1*1*128 convolutional layer, a 3*3*128 Convolutional layer, a 1*1*512 convolutional layer; the third residual module contains a 1*1*256 convolutional layer, a 3*3*256 convolutional layer, and a 1*1*1024 The convolutional layer; the fourth residual module contains a 1*1*512 convolutional layer, a 3*3*512 convolutional layer, and a 1*1*2048 convolutional layer; the final residual network output The feature map of 7*7*2048.

3. a kind of intelligent makeup mirror system based on multi-head attention target detection according to claim 1, is characterized in that: the coding neural network based on multi-head attention mechanism of described target detection unit comprises a sequence embedding layer, a multi-head Attention layer, a network layer that regularizes the output of the multi-head attention layer, a feed-forward neural network layer, a network layer that regularizes the output of the feed-forward neural network; encoding neural network based on the multi-head attention mechanism The input is the 7*7*2048 feature map finally output by the residual network. The feature map is converted into 49 1*2048 sequence data by the sequence embedding layer and input to the multi-head attention layer. The multi-head attention layer captures the sequence data. The features of the face area are input into the feed-forward neural network through the regularization layer, and after regularization, the final sequence data about the feature of the face area is obtained.

4. A kind of smart makeup mirror system based on multi-head attention target detection according to claim 1, wherein: the generation confrontation network of the target detection unit includes a generator that outputs specified type data from random input data, A discriminator that judges the output data of the generator according to the real data; the input of the generator is the 7*7*2048 feature map of the final output of the residual network, and outputs an effect map after predicting face makeup; The input of the discriminator is the makeup effect map output by the generator. When the discriminator calculates that the credibility of the makeup effect map generated by the generator is greater than the threshold, it outputs the makeup effect map to the control unit, and the control unit will display the makeup effect map. On the makeup mirror, otherwise, the discriminator will re-input this makeup effect map into the generator, asking the generator to regenerate the makeup effect map until the confidence of the generated makeup effect map is greater than the threshold.

5. The intelligent cosmetic mirror system based on multi-head attention target detection according to claim 1, characterized in that: the wireless communication module contained in the control unit is used to establish control unit and cloud server, mobile terminal, mobile network The connection among them realizes the data interaction and control interaction between the control unit and corresponding parts.

6. A kind of intelligent makeup mirror system based on multi-head attention target detection according to claim 1, characterized in that: the voice input module contained in the control unit receives the user's voice control command, and completes the execution and response of the corresponding command .

7. A kind of intelligent cosmetic mirror system based on multi-head attention target detection according to claim 1, characterized in that: the voice output module contained in the control unit converts the internal control signal into voice information and outputs it to the speaker, prompting the user Related Information.

8. A kind of intelligent makeup mirror system based on multi-head attention target detection according to claim 1, characterized in that: the clock module contained in the control unit produces periodic pulse signals, so that the components in the control unit can have order execution command; the storage module included in the control unit is used to store the data generated in the operation of the microprogram, the execution result of the control command, the result produced by the target detection unit, and the input data of each network for target detection; the control unit’s The arithmetic logic operation module performs corresponding arithmetic operation and logic operation; the microprogram conversion module included in the control unit is used to convert other programs into executable microprograms of the control unit.