CN112560701B

CN112560701B - Face image extraction method and device and computer storage medium

Info

Publication number: CN112560701B
Application number: CN202011503381.7A
Authority: CN
Inventors: 杨青川; 宁瑶
Original assignee: Chengdu Xinchao Media Group Co Ltd
Current assignee: Chengdu Xinchao Media Group Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-10-25
Anticipated expiration: 2040-12-17
Also published as: CN112560701A

Abstract

The invention discloses a face image extraction method, a face image extraction device and a computer readable storage medium, wherein the method comprises the following steps: acquiring an image to be identified; and inputting the image to be recognized into a face extraction model to obtain a face image corresponding to each face in the image to be recognized, wherein the face extraction model comprises a feature map extraction structural layer, a first feature fusion structural layer and a second feature fusion structural layer. The invention can avoid the characteristic loss caused by the layer-by-layer addition of the characteristic images in the traditional face recognition characteristic extraction, improves the characteristic characterization capability of the finally obtained characteristic image, further avoids the problem that the traditional detection network can not recognize the edge portrait in the image, and improves the face recall rate.

Description

Face image extraction method and device and computer storage medium

Technical Field

The invention relates to the technical field of face recognition, in particular to a face image extraction method and device and a computer storage medium.

Background

With the rapid development of the face detection technology, the research on the face detection neural network is relatively deep, and the technology is continuously updated and iterated; currently, a more common face detection neural network is a Retinaface detection network (which is a one-stage face detection network based on an object detection algorithm).

The working process of the Retinaface detection network is as follows: after the Retinaface detection network extracts the image features of the image to be recognized, more fine face features are further extracted through a Feature Pyramid Network (FPN) and an SSH (single stage hierarchy) network, and then the face frame and the face Feature point coordinates are predicted through a detection head, so that the face image is extracted from the image to be recognized according to the face Feature point coordinates and the face frame.

At present, when a Retinaface detection network extracts human face features through an FPN network, the features of 3 dimensions are usually added directly (i.e., add operation is adopted to directly add feature maps extracted under 3 receptive fields), and generally, layer-by-layer addition is performed (i.e., add operation is adopted twice, and the add operation is added to a low dimension in order from a high dimension), although the operation is simple, the operation directly locks the feature extraction operation into addition operation, so that the feature fusion loses flexibility, the feature loss is easily caused, the edge portrait in an image is not easy to identify, and the recall rate of the human face (i.e., the detection rate of the human face, i.e., the proportion of the detected human face to the total human face in the image) is greatly reduced.

Disclosure of Invention

In order to solve the problem of low face recall rate caused by feature layer-to-layer addition adopted in feature extraction in the existing face recognition, the invention aims to provide a face image extraction method, a face image extraction device and a computer-readable storage medium, wherein fusion bridging of a highest-dimensional feature map and a lowest-dimensional feature map is added during feature extraction so as to replace layer-to-layer addition of the traditional feature maps, reduce feature fusion loss, improve face recognition in an image and further improve the face recall rate.

In a first aspect, the present invention provides a method for extracting a face image, including:

acquiring an image to be recognized, wherein the image to be recognized at least comprises a human face;

inputting the image to be recognized into a face extraction model to obtain a face image corresponding to each face in the image to be recognized, wherein the face extraction model comprises a feature map extraction structural layer, a first feature fusion structural layer and a second feature fusion structural layer;

the characteristic map extraction structural layer is used for extracting characteristic information of the image to be identified under three receptive field conditions to respectively obtain a first characteristic map, a second characteristic map and a third characteristic map of the image to be identified;

the first feature fusion structural layer is used for performing first feature fusion on the second feature map and the third feature map to obtain a fourth feature map of the image to be identified;

the second feature fusion structural layer is configured to perform second feature fusion on the first feature map, the third feature map, and the fourth feature map to obtain a fifth feature map corresponding to each face in the image to be recognized, so as to extract the face image from the image to be recognized through the fifth feature map.

Based on the above disclosure, in the feature fusion, the first feature map, the third feature map and the fourth feature map are subjected to feature fusion to obtain a fifth feature map for extracting a face image; the essence is as follows: adding the third feature map into the feature fusion of the first feature map and the fourth feature map, which is equivalent to adding a fusion bridge of a highest-dimensional feature map (namely, the third feature map) and a lowest-dimensional feature map (namely, the first feature map), thereby realizing the fusion of features under three dimensions; through the design, the invention can avoid the characteristic loss caused by the layer-by-layer addition of the characteristic images in the traditional face recognition characteristic extraction, improve the characteristic characterization capability of the finally obtained characteristic image, further avoid the problem that the traditional detection network can not recognize the edge portrait in the image, and improve the face recall rate.

In one possible design, the second feature fusion structural layer includes: the first up-sampling layer, the second up-sampling layer and the first channel fusion layer;

the first up-sampling layer is used for performing first up-sampling on the third feature map to obtain a sixth feature map of the image to be identified;

the second upsampling layer is configured to perform second upsampling on the fourth feature map to obtain a seventh feature map of the image to be identified;

the first channel fusion layer is used for carrying out channel fusion on the first characteristic diagram, the sixth characteristic diagram and the seventh characteristic diagram to obtain the fifth characteristic diagram.

Based on the disclosure, the invention discloses a specific network structure of the second feature fusion structural layer, namely, the third feature graph and the fourth feature graph are sampled up first to realize the amplification of the image; and finally, performing channel fusion (namely conca operation) on the sixth feature map, the seventh feature map and the first feature map which are obtained through upsampling, and further realizing the fusion of features in the feature maps under three dimensions to obtain a fifth feature map.

Through the design, when the method is used for feature fusion, the channel fusion is used for replacing direct addition of the feature maps in the traditional FPN network, so that the number of the extracted features can be increased, the feature characterization capability of the obtained feature maps is further increased, and the face recall rate is further improved.

In one possible design, the face extraction model further includes: and the first convolution layer is used for performing first convolution processing on the fifth feature map so as to reduce the number of channels of the fifth feature map and obtain an eighth feature map of the image to be identified.

Based on the above disclosure, after the fifth feature map is obtained, the present invention further performs a first convolution process on the fifth feature map to reduce the number of channels of the fifth feature map, so as to obtain an eighth feature map; through the design, more abundant characteristic information can be extracted without increasing the calculated amount, and the identification efficiency of the model is improved.

In one possible design, the face extraction model further includes: and the second convolution layer is used for performing second convolution processing on the eighth feature map to obtain a ninth feature map of the image to be recognized, so that the face image is extracted from the image to be recognized through the ninth feature map.

Based on the above disclosure, the eighth feature map is subjected to the second convolution processing by the second convolution layer, and feature information is re-extracted, so that the ninth feature map containing finer features is obtained, and fine feature information is provided for extraction of a face image, which is convenient for extraction of a subsequent face image.

In one possible design, the first convolution process uses a pointwise convolution operation, and uses a convolution kernel of 1 × 1 with a step size of 1.

Based on the above disclosure, the present invention discloses the type of convolution operation and the parameter of the convolution kernel used in the first convolution processing, i.e. the pointwise convolution operation is used, and the number of channels of the fifth feature map is compressed by using the convolution kernel with 1 × 1 and step size of 1.

In one possible design, the face extraction model further includes: a nonlinear conversion structure layer;

the nonlinear conversion structural layer is configured to perform nonlinear conversion on the first feature map, the second feature map, and the third feature map to obtain a tenth feature map, an eleventh feature map, and a twelfth feature map of the image to be recognized, so that the eleventh feature map and the twelfth feature map are input into the first feature fusion structural layer to perform first feature fusion to obtain the fourth feature map, and the tenth feature map is input into the second feature fusion structural layer to perform second feature fusion with the fourth feature map and the third feature map to obtain the fifth feature map.

Based on the disclosure, the present invention discloses a method for performing nonlinear conversion on a first feature map, a second feature map and a third feature map by setting a nonlinear conversion structure layer, which substantially comprises: the classification capability of the face extraction model is improved, so that the model can learn more features, the nonlinear expression capability of the face extraction model is improved, the recognition of feature information in the three feature maps is improved, and the tenth, eleventh and twelfth feature maps containing richer feature information are obtained, so that more accurate feature information is provided for the subsequent feature fusion and feature extraction.

In one possible design, the nonlinear conversion structure layer includes: a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, a first nonlinear conversion layer, a second nonlinear conversion layer, and a third nonlinear conversion layer;

the fifth convolution layer is used for performing fifth convolution processing on the first feature map to obtain a thirteenth feature map of the image to be identified;

the sixth convolution layer is configured to perform sixth convolution processing on the second feature map to obtain a fourteenth feature map of the image to be identified;

the seventh convolution layer is configured to perform seventh convolution processing on the third feature map to obtain a fifteenth feature map of the image to be identified;

the first nonlinear conversion layer is used for performing nonlinear conversion on the thirteenth feature map by using a PReLU activation function to obtain the tenth feature map;

the second nonlinear conversion layer is configured to perform nonlinear conversion on the fourteenth feature map by using a PReLU activation function to obtain an eleventh feature map;

the third nonlinear conversion layer is configured to perform nonlinear conversion on the fifteenth feature map by using a PReLU activation function to obtain the twelfth feature map.

Based on the disclosure, the invention discloses a network composition structure of a nonlinear conversion structure layer, namely, three convolution layers are utilized to respectively carry out convolution operation on feature maps (namely, a first feature map, a second feature map and a third feature map) under three dimensions, and feature information of the feature maps is extracted to obtain the feature maps containing more accurate feature information; and finally, carrying out nonlinear conversion on the obtained feature map by using a PReLU activation function, so that the model can learn more features, and further the nonlinear expression capability of the feature map is improved, thereby facilitating the identification of different types of feature information in the feature map and avoiding the problem that the edge portrait in the traditional detection network cannot be identified.

In one possible design, the first feature fusion structural layer includes: a third up-sampling layer, a second channel fusion layer, a third convolution layer and a fourth convolution layer;

the third upsampling layer is used for performing third upsampling on the twelfth feature map to obtain a sixteenth feature map of the image to be identified;

the second channel fusion layer is used for carrying out channel fusion on the sixteenth feature map and the eleventh feature map to obtain a seventeenth feature map of the image to be identified;

the third convolution layer is used for performing third convolution processing on the seventeenth feature map so as to reduce the number of channels of the seventeenth feature map and obtain an eighteenth feature map of the image to be identified;

and the fourth convolution layer is used for performing fourth convolution processing on the eighteenth feature map to obtain the fourth feature map.

Based on the disclosure, the invention discloses a specific network structure of a first feature fusion structural layer, namely, the twelfth feature map is sampled first, and the twelfth feature map is amplified to obtain a sixteenth feature map; then, channel fusion is carried out on the sixteenth feature map and the eleventh feature map (namely concat operation is carried out to obtain a seventeenth feature map), so that the feature quantity is improved, the feature fusion loss is reduced, and the feature characterization capability of the obtained feature maps is further improved; then, a third convolution operation is performed, which is also used for compressing the number of channels of the seventeenth feature map (obtaining an eighteenth feature map) so as to reduce the amount of calculation; and finally, performing fourth convolution operation on the eighteenth feature map to obtain a fourth feature map.

In a second aspect, the present invention provides a face image extraction device, including: the device comprises an acquisition unit and a face image extraction unit;

the acquisition unit is used for acquiring an image to be recognized, wherein the image to be recognized at least comprises a human face;

the face image extraction unit is used for inputting the image to be recognized into a face extraction model to obtain a face image corresponding to each face in the image to be recognized, wherein the face extraction model comprises a feature map extraction structural layer, a first feature fusion structural layer and a second feature fusion structural layer.

In one possible design, the face image extraction unit includes: a first channel fusion subunit;

the first channel fusion subunit is configured to perform first upsampling on the third feature map by using a first upsampling layer to obtain a sixth feature map of the image to be identified;

the first channel fusion subunit is configured to perform second upsampling on the fourth feature map by using a second upsampling layer, so as to obtain a seventh feature map of the image to be identified;

the first channel fusion subunit is further configured to perform channel fusion on the first feature map, the sixth feature map, and the seventh feature map by using a first channel fusion layer, so as to obtain the fifth feature map.

In one possible design, the apparatus further includes: a first convolution unit;

the first convolution unit is configured to perform first convolution processing on the fifth feature map by using the first convolution layer to reduce the number of channels of the fifth feature map, so as to obtain an eighth feature map of the image to be identified.

In one possible design, the apparatus further includes: a second convolution unit;

the second convolution unit is configured to perform second convolution processing on an eighth feature map by using a second convolution layer to obtain a ninth feature map of the image to be recognized, so as to extract the face image from the image to be recognized through the ninth feature map.

In one possible design, the face image extraction unit further includes: a nonlinear conversion subunit;

the nonlinear conversion subunit is configured to perform nonlinear conversion on the first feature map, the second feature map, and the third feature map by using a nonlinear conversion structural layer, to obtain a tenth feature map, an eleventh feature map, and a twelfth feature map of the image to be recognized, respectively, so as to input the eleventh feature map and the twelfth feature map into the first feature fusion structural layer for first feature fusion to obtain the fourth feature map, input the tenth feature map into the second feature fusion structural layer, and perform second feature fusion with the fourth feature map and the third feature map to obtain the fifth feature map.

In one possible design:

the nonlinear conversion subunit specifically uses a fifth convolution layer to perform fifth convolution processing on the first feature map to obtain a thirteenth feature map of the image to be identified;

the nonlinear conversion subunit specifically uses a sixth convolution layer to perform sixth convolution processing on the second feature map to obtain a fourteenth feature map of the image to be identified;

the nonlinear conversion subunit specifically uses a seventh convolution layer to perform seventh convolution processing on the third feature map to obtain a fifteenth feature map of the image to be identified;

the nonlinear conversion subunit specifically uses the first nonlinear conversion layer and uses a PReLU activation function to perform nonlinear conversion on the thirteenth feature map, so as to obtain the tenth feature map;

the nonlinear conversion subunit specifically uses a second nonlinear conversion layer and performs nonlinear conversion on the fourteenth feature map by using a PReLU activation function to obtain the eleventh feature map;

the nonlinear conversion subunit specifically uses a third nonlinear conversion layer and performs nonlinear conversion on the fifteenth feature map by using a PReLU activation function to obtain the twelfth feature map.

In one possible design, the face image extraction unit further includes: a second channel fusion subunit;

the second channel fusion subunit is configured to perform third upsampling on the twelfth feature map by using a third upsampling layer, so as to obtain a sixteenth feature map of the image to be identified;

the second channel fusion subunit is configured to perform channel fusion on the sixteenth feature map and the eleventh feature map by using a second channel fusion layer, so as to obtain a seventeenth feature map of the image to be identified;

the second channel fusion subunit is configured to perform, by using a third convolution layer, third convolution processing on the seventeenth feature map to reduce the number of channels of the seventeenth feature map, so as to obtain an eighteenth feature map of the image to be identified;

and the second channel fusion subunit is configured to perform fourth convolution processing on the eighteenth feature map by using a fourth convolution layer to obtain the fourth feature map.

In a third aspect, the present invention provides a second facial image extraction apparatus, including a memory, a processor and a transceiver, which are sequentially connected in communication, where the memory is used to store a computer program, the transceiver is used to transmit and receive messages, and the processor is used to read the computer program and execute the facial image extraction method as described in the first aspect or any one of the possible designs in the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium, which stores instructions that, when executed on a computer, perform the face image extraction method according to the first aspect or any one of the possible designs of the first aspect.

In a fifth aspect, the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the face image extraction method as described in the first aspect or any one of the possible designs of the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an improved FPN network in the face extraction model provided in the present invention.

Fig. 2 is a schematic flow chart of the face image extraction method provided by the invention.

Fig. 3 is a schematic diagram of a network structure of the face extraction model provided by the invention.

Fig. 4 is a schematic structural diagram of a first face image extraction device provided in the present invention.

Fig. 5 is a schematic structural diagram of a second face image extraction device provided by the invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments. It should be noted that the following examples are provided to aid understanding of the present invention, but are not intended to limit the present invention. Specific structural and functional details disclosed herein are merely illustrative of example embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.

It should be understood that, for the term "and/or" as may appear herein, it is merely an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time; for the term "/and" as may appear herein, which describes another associative object relationship, it means that two relationships may exist, e.g., a/and B, may mean: a exists independently, and A and B exist independently; in addition, for the character "/" that may appear herein, it generally means that the former and latter associated objects are in an "or" relationship.

It will be understood that when an element is referred to herein as being "connected," "connected," or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Conversely, if a unit is referred to herein as being "directly connected" or "directly coupled" to another unit, it is intended that no intervening units are present. In addition, other words describing relationships between units (e.g., "between \8230; between" pairs "directly between \8230; between", "adjacent" pairs "directly adjacent", etc.) should be interpreted in a similar manner.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, numbers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

It should be understood that specific details are provided in the following description to facilitate a thorough understanding of example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, systems may be shown in block diagrams in order not to obscure the examples in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the example embodiments.

Examples

Referring to fig. 1, a face extraction model is provided for this embodiment, and the face extraction model in this embodiment changes a FPN network structure in a conventional Retinaface detection network; it includes: the feature map extraction structure layer, the first feature fusion structure layer and the second feature fusion structure layer; the feature map extraction structure layer is used for extracting feature information of an image to be identified under three receptive field conditions to respectively obtain a first feature map, a second feature map and a third feature map of the image to be identified (namely extracting the feature information of the image to be identified under three different dimensions to obtain feature maps under three dimensions); the first feature fusion structural layer is used for carrying out channel fusion on the second feature graph and the third feature graph to obtain a fourth feature graph of the image to be identified; the second feature fusion structure layer is used for performing feature fusion on the fourth feature map, the third feature map and the first feature map to obtain a fifth feature map corresponding to each face in the image to be recognized, so that the face image is extracted from the image to be recognized through the fifth feature map (the essence is that a fusion bridge of a highest-dimensional feature map (namely the third feature map) and a lowest-dimensional feature map (namely the first feature map) is added, and further the fusion of features in three dimensions is realized, so that the loss of features can be avoided, the number of features is increased, the recognition capability of edge figures in the image is increased, and further the face recall rate is increased).

In the following description, the technical solution provided by the embodiment of the present application is applied to the model architecture shown in fig. 1 as an example.

As shown in fig. 2, the method for extracting a face image provided in the first aspect of this embodiment is applicable to face recognition in any place (for example, an elevator, a bus, a mall, a movie theater, etc.), and the method may include, but is not limited to, the following steps S101 to S102.

S101, obtaining an image to be recognized, wherein the image to be recognized at least comprises a human face.

Step S101 is a process for acquiring an image to be recognized, so as to obtain a face image for subsequent input into the model, thereby implementing face recognition using the face image to obtain character information.

In this embodiment, for example, the acquisition of the image to be recognized may be, but is not limited to: acquiring a monitoring video in a region to be monitored, and performing frame-by-frame processing on the monitoring video to obtain a plurality of images; or the worker directly uploads the image containing the face as the image to be recognized.

And S102, inputting the image to be recognized into a face extraction model to obtain a face image corresponding to each face in the image to be recognized.

Step S102 is a process of extracting a face image from the image to be recognized by using the face extraction model, so as to perform face recognition by using the extracted face image in the following.

As already described above, the model of the present embodiment includes a feature map extraction structure layer, a first feature fusion structure layer, and a second feature fusion structure layer; therefore, after the image to be recognized is input into the model, the feature map extraction and the feature fusion of the feature map are sequentially carried out; that is, in the embodiment, when feature fusion is performed, the third feature map is added to the fusion of the first feature map and the fourth feature map, so that feature bridging between the highest-dimensional feature map and the lowest-dimensional feature map is realized, and thus, fusion of features in three dimensions is realized, thereby avoiding feature loss, increasing the number of features, increasing the recognition capability of edge portrait in an image, and improving the face recall rate.

The following describes the face extraction model in this embodiment in detail:

firstly, the feature map extraction structure layer is used for preprocessing the image to be recognized, and the feature map extraction structure layer is used for substantially extracting feature information of the image to be recognized to obtain the feature map of the image to be recognized.

In this embodiment, the feature map extraction structure layer is to perform the first extraction of feature information under three receptive field conditions; the receptive field is the size of a mapping region of a pixel point on the characteristic image on the input image (namely, the image to be identified), namely, characteristic information of a certain region in the input image is extracted; that is, the larger the value of the receptive field is, the larger the range of the original image which can be touched by the receptive field is, and the more global and higher semantic level features are included in the receptive field.

In the embodiment, the feature information is extracted under the condition of three receptive fields, which is equivalent to the feature information is extracted under three dimensions, so that feature information of different layers can be obtained, and the comprehensiveness of the extracted feature information is ensured.

As shown in fig. 1, in this embodiment, an exemplary feature map extraction structure layer processes an image to be recognized first, and reduces the image to be recognized by 8 times, 32 times, and 64 times, respectively, so as to change the number of channels of the image to be recognized into 64, 128, and 256; that is, the three receptive fields in this embodiment are: an image to be recognized when reduced by 8 times, an image to be recognized when reduced by 32 times, and an image to be recognized when reduced by 64 times; as shown in fig. 1, 64 × w/8 × h/8 in fig. 1, that is, 64 indicates the number of channels, and w/8 × h/8 indicates that the image width and height are respectively reduced by 8 times.

After the images to be identified in the three receptive fields are obtained, the feature information can be extracted to respectively obtain a first feature map, a second feature map and a third feature map. In this embodiment, for example, the feature information of the image to be recognized is extracted, but not limited to, a convolution operation, that is, a convolution kernel is used to extract the feature (for example, a convolution kernel with a size of 3 × 3 and a step size of 1, or a convolution kernel with a size of 5 × 5 and a step size of 2, etc.).

After the first feature map, the second feature map and the third feature map are obtained, feature fusion can be performed on the second feature map and the third feature map by using the first feature fusion structural layer to obtain a fourth feature map; and the second feature fusion structural layer performs feature fusion on the first feature map, the third feature map and the fourth feature map to obtain a fifth feature map.

As shown in fig. 1, the following details the network structure of the second feature fusion structure layer:

in this embodiment, the second feature fusion structure layer may include, but is not limited to: the first up-sampling layer, the second up-sampling layer and the first channel fusion layer.

The first up-sampling layer is used for performing first up-sampling on the third feature map to obtain a sixth feature map of the image to be identified; the second upsampling layer is used for performing second upsampling on the fourth feature map to obtain a seventh feature map of the image to be identified; and the first channel fusion layer is used for carrying out channel fusion on the first characteristic diagram, the sixth characteristic diagram and the seventh characteristic diagram to obtain the fifth characteristic diagram.

By the foregoing explanation, that is, in this embodiment, the fourth feature map and the third feature map are upsampled before feature fusion with the first feature map, and then are subjected to image amplification, and then are subjected to channel fusion through the first channel fusion layer.

In this embodiment, for example, the first upsampling and the second upsampling may be, but not limited to, by using: nearest neighbor interpolation, bilinear interpolation, mean interpolation, or median interpolation; and the magnification factor may be, but is not limited to: the third characteristic diagram is enlarged by a factor of 4, and the fourth characteristic diagram is enlarged by a factor of two.

In addition, in the embodiment, when feature fusion is performed, fusion of features in the first feature map, the third feature map and the fourth feature map is realized by adopting channel fusion; road fusion is also known as: and (3) concat channel fusion, which is a common feature fusion mode in each neural network model and is used for merging the channel numbers of the convolutional layers, wherein the essence of the channel fusion is to integrate feature information in the first feature map, the third feature map and the fourth feature map to obtain a fifth feature map of the image to be identified.

Through the design, on one hand, the method realizes the characteristic bridging of the highest-dimensional characteristic diagram and the lowest-dimensional characteristic diagram, thereby realizing the fusion of the characteristics under three dimensions and further avoiding the loss of the characteristics; on the other hand, the method uses channel fusion to replace direct addition of the feature maps in the traditional FPN network, can improve the number of features, obtain more comprehensive feature information, and enable the finally obtained fifth face feature map to contain richer feature information, thereby increasing the recognition capability of the edge portrait in the image and further improving the face recall rate.

In addition, in this embodiment, the present invention further provides a first convolution layer, so as to perform a first convolution process on the fifth feature map, so as to compress the number of channels of the fifth feature map, and obtain an eighth feature map with the number of channels being less than that of the fifth feature map, thereby ensuring that richer feature information is extracted while not increasing the calculation amount, and improving the recognition speed of the model.

In this embodiment, for example, the first convolution layer is convolved by using pointwise convolution operation, and the convolution kernel used is 1 × 1, and the step size is 1; i.e. equivalent to a convolution using 1x 1; the 1 × 1 convolution does not need to consider the relationship between pixels and peripheral pixels, and is mainly used for adjusting the number of channels, performing linear combination on pixel points on different channels, and then performing nonlinear operation, so that the function of reducing the dimension can be completed, namely, the 1 × 1 convolution can realize the dimension reduction of a feature diagram (namely, the reduction of the number of channels without changing the width and the height of an image), thereby reducing the calculation amount.

In this embodiment, a first channel fusion layer is used for example to perform channel fusion on the first feature map, the third feature map and the fourth feature map, and the number of channels of the obtained fifth feature map may be, but is not limited to, 192; the first convolution processing is performed, and the number of compressed channels may be, but is not limited to, preset for the user (e.g., the number of channels compressed to 64).

After the 1 × 1 convolution, re-extracting feature information from the feature map obtained by the 1 × 1 convolution (i.e., extracting feature information from the eighth feature map) by using a second convolution layer, so as to obtain a ninth feature map, and extracting a face image from the image to be recognized by using the ninth feature map; in this embodiment, the convolution kernel used in the second convolution processing is 3 × 3, and the step size is 1.

In this embodiment, the fifth feature map and the ninth feature map can both extract a face image from an image to be recognized, and the difference between the fifth feature map and the ninth feature map is that: the number of channels is different, namely, the calculated amount is different when the face image is extracted, and the fifth face feature map is not compressed by the number of channels, so the calculated amount is large and the speed is slow; and the ninth feature map uses 1 × 1 convolution to realize the compression of the number of channels, so that the calculated amount is greatly reduced when the face image is extracted, and the extraction speed is faster than that of the fifth feature map.

As shown in fig. 1, in order to further improve the feature information in the feature map, in this embodiment, before entering the first feature fusion structural layer and the second feature fusion structural layer, the first feature map, the second feature map and the third feature map need to be subjected to nonlinear conversion to improve the nonlinear expression capability of the feature maps, so as to increase the classification capability of the model on the feature information, so that more features are learned, thereby ensuring the comprehensiveness of feature information acquisition, providing comprehensive feature information for the identification of subsequent edge portrait, and further improving the recall rate of the face.

As shown in fig. 1, in this embodiment, a nonlinear conversion structural layer is arranged to implement nonlinear conversion on the first feature map, the second feature map, and the third feature map, so as to obtain a tenth feature map, an eleventh feature map, and a twelfth feature map of the image to be recognized, respectively.

That is, in this embodiment, the first feature fusion structural layer is equivalent to feature fusion of the twelfth feature map and the eleventh feature map, and the second feature fusion structural layer is equivalent to feature fusion of the tenth feature map, the fourth feature map, and the twelfth feature map.

As shown in fig. 1, example nonlinear conversion structure layers may include, but are not limited to: a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, a first nonlinear conversion layer, a second nonlinear conversion layer, and a third nonlinear conversion layer.

As shown in fig. 1, the fifth convolution layer is configured to perform a fifth convolution process on the first feature map to obtain a thirteenth feature map of the image to be identified; the sixth convolution layer is configured to perform sixth convolution processing on the second feature map to obtain a fourteenth feature map of the image to be identified; and the seventh convolution layer is used for performing seventh convolution processing on the third feature map to obtain a fifteenth feature map of the image to be identified.

The first nonlinear conversion layer is used for carrying out nonlinear conversion on the thirteenth feature map by using a PReLU activation function to obtain a tenth feature map; the second nonlinear conversion layer is configured to perform nonlinear conversion on the fourteenth feature map by using a PReLU activation function to obtain an eleventh feature map; the third nonlinear conversion layer is configured to perform nonlinear conversion on the fifteenth feature map by using a PReLU activation function to obtain the twelfth feature map.

That is, in this embodiment, first, convolution processing is performed on the feature maps extracted in three dimensions by using three convolution layers in the nonlinear conversion structure layer (i.e., convolution processing is performed on the first feature map, the second feature map, and the third feature map), so as to re-extract feature information by using convolution operation, and obtain feature maps (i.e., the thirteenth, fourteenth, and fifteenth feature maps) containing finer feature information, so as to provide fused feature information for subsequent channel fusion.

In this embodiment, the convolution kernels used for convolution processing of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer are all 1 × 1 in size, and the step size is all 1.

Then, the thirteenth feature map, the fourteenth feature map and the fifteenth feature map are subjected to nonlinear conversion by using the first nonlinear conversion layer, so as to improve the classification capability of the model on the feature information, so that the feature information in the sixteenth feature map, the seventeenth feature map and the fifteenth feature map can be extracted more easily, and the nonlinear expression capability of the feature maps obtained subsequently is increased, so that the tenth feature map, the eleventh feature map and the twelfth feature map can be obtained.

In this embodiment, for example, the three nonlinear conversion layers perform nonlinear conversion by using a pralu activation function, where the pralu activation function is a non-saturation activation function, and can make a model have stronger classification capability, so as to solve the problem of linear inseparability, that is, when extracting feature information, more different feature information can be identified, and the identification rate of the feature information is improved; therefore, after the activation function is introduced, the extraction of the characteristic information can be assisted, more comprehensive characteristic information can be obtained, and the loss of the characteristic information is avoided.

In this embodiment, the feature map obtained after the nonlinear conversion can be used as a feature map for feature fusion, and respectively enter the first feature fusion structure layer and the second feature fusion layer for feature fusion.

As shown in fig. 1, in this embodiment, the first feature fusion structure layer may include, but is not limited to: a third upsampling layer, a second channel fusing layer, a third convolutional layer, and a fourth convolutional layer.

The twelfth feature map enters a third up-sampling layer, and third up-sampling processing is carried out to obtain a sixteenth feature map; in this embodiment, the third upsampling is based on the same principle as the upsampling performed by the first upsampling layer and the second upsampling layer, and is used for realizing the amplification of the feature map so as to perform subsequent feature fusion with the eleventh feature map.

In this embodiment, the third upsampling may be, but is not limited to, a twelfth feature map is enlarged by two times; namely, amplifying the twelfth feature map by 2 times to obtain a sixteenth feature map, and then performing channel fusion with the eleventh feature map through the second channel fusion layer to obtain a seventeenth feature map.

In other words, in the embodiment, in the feature fusion of the second feature map and the third feature map, the channel fusion is also adopted to replace the direct addition of the feature maps in the conventional FPN network; through the design, the invention increases the fusion characteristics of the second characteristic diagram and the third characteristic diagram, further improves the number of the characteristics, and obtains more comprehensive characteristic information, so that the finally obtained fifth characteristic diagram contains more abundant characteristic information, and is beneficial to extracting the face image.

Similarly, in this embodiment, after the second feature map and the third feature map are subjected to channel fusion, the third convolution layer is also provided, so as to compress the number of channels in the seventeenth feature map, thereby extracting more abundant feature information without increasing the amount of computation.

In this embodiment, the convolution processing performed by the third convolution layer is performed by using pointwise convolution operation in the same principle as the convolution processing performed by the first convolution layer, and the convolution kernel used is also 1 × 1, and the step size is 1; i.e. the compression of the number of channels is also achieved by 1x 1 convolution.

In this embodiment, since the seventeenth feature map is obtained by convolving the feature map obtained by extracting features of the image to be recognized, which is reduced by 16 times, 128 channels are compressed, which may be, for example and without limitation: compressing the number of channels of the seventeenth feature map from 128 to 64 channels to obtain an eighteenth feature map.

Finally, the feature map obtained by 1 × 1 convolution (i.e., the eighteenth feature map) can be re-extracted by the fourth convolution layer, so that the fourth feature map can be obtained, and then the second feature fusion structural layer is subjected to channel fusion with the first feature map (i.e., equivalent to the tenth feature map) and the third feature map (equivalent to the twelfth feature map).

In this embodiment, the convolution kernel used in the fifth convolution processing is 3 × 3, and the step size is 1.

In addition, in this embodiment, after the ninth feature map is obtained, a fourth nonlinear conversion layer may be further provided, which also performs nonlinear conversion on the ninth feature map by using a prilu activation function, so as to improve the capability of the ninth feature map in classifying the feature information, and more easily extract the feature information in the ninth feature map, and further improve the nonlinear expression capability of the feature map obtained after the nonlinear conversion, so as to obtain more comprehensive feature information, which is beneficial to extracting the edge portrait in the image to be recognized.

Therefore, through the foregoing detailed description of the network structure layer in the face extraction model, the present invention performs the detailed operations of feature extraction as follows:

firstly, carrying out nonlinear conversion on initial feature maps (namely, a first feature map, a second feature map and a third feature map) under three dimensions of an image to be recognized, wherein the essence is as follows: the classification capability of the face extraction model is improved, so that the model can learn more features, the nonlinear expression capability of the face extraction model is improved, the identification of edge feature information in three feature maps is facilitated, and tenth, eleventh and twelfth feature maps containing richer feature information are obtained, so that an image basis is provided for the subsequent more accurate extraction of face features;

secondly, channel fusion is carried out on the three feature maps after nonlinear conversion, namely channel fusion is carried out on the eleventh feature map and the eleventh feature map to obtain a fourth feature map; then, channel fusion is performed on the tenth feature map (obtained by performing nonlinear conversion on the first feature map), the sixth feature map (obtained by performing nonlinear conversion on the third feature map to obtain a twelfth feature map, and then performing third up-sampling on the twelfth feature map to obtain a sixth feature map) and the fourth feature map, so that the fifth feature map can be obtained.

The essence of the above operation is: (1) Channel fusion is used for replacing layer-by-layer addition of feature maps in the traditional FPN network, so that the number of feature information is increased, the feature loss is reduced, the feature characterization capability of the finally obtained face feature map is improved, the problem that the traditional detection network cannot identify edge portraits in the image is further avoided, and the face recall rate is improved; (2) The third feature map is added into the channel fusion of the fourth feature map and the first feature map, which is equivalent to directly performing channel fusion on the features of three dimensions, so that the feature loss caused by layer-by-layer addition can be avoided, and the feature characterization capability of the face feature map is further improved, thereby further improving the face recall rate.

Finally, in order to reduce the calculation amount and further increase the nonlinear expression capability of the feature map, the fifth feature map can be subjected to 1-by-1 convolution and nonlinear conversion, so that the calculation amount is reduced, the comprehensiveness of feature information is improved, the extraction of the face in the identification image is facilitated, and the face recall rate is further improved.

As shown in fig. 3, the following describes the complete network structure of the face extraction model in this embodiment in detail:

in this embodiment, the twelfth feature map (that is, the first feature map is obtained after the nonlinear conversion structure layer processing), and the fourth feature map may be used as feature maps extracted from the face image, and participate in a subsequent SSH network, so as to extract the face image.

As shown in fig. 3, in the present embodiment, the face extraction model includes: a mobileNet network (the main compression strategy of which is deep separable Convolution), the aforementioned modified FPN network, SSH network, and face detection network.

In this embodiment, the mobileNet network performs depth separable convolution operation on the image to be recognized so as to obtain a first feature map, a second feature map and a third feature map, and then the first feature map, the second feature map and the third feature map are input into the improved FPN network to extract feature maps, so as to obtain a fourth feature map, a ninth feature map and a twelfth feature map respectively; then, inputting the fourth feature map, the ninth feature map and the twelfth feature map into the SSH network for feature re-extraction, thereby obtaining a finer feature map; and finally, inputting the feature map output by the SSH network into a face detection network to detect the face to obtain a face frame and face feature point coordinates, so as to realize extraction of the face in the image to be recognized according to the face frame and the face feature point coordinates to obtain a face image.

In this embodiment, the SSH Network is an improvement based on a Visual Geometry Group Network (VGG-16 Network), and three parallel networks (i.e., the SSH1 Network, the SSH2 Network, and the SSH3 Network in fig. 3) are constructed above the VGG, and the three parallel networks are respectively used for detecting a small face, a medium face, and a large face according to a feature map to obtain a finer feature map; in this embodiment, the feature maps output by the SSH network are three feature map lists (i.e., the first, second, and third feature maps in fig. 3), and the subsequent face detection network performs face detection on the images in the three feature map lists.

The feature map is the superposition of a plurality of two-dimensional feature maps, which is equivalent to the description of a plurality of angles for one feature map, namely, the feature map is convolved by using a plurality of different convolution cores in the SSH network to obtain different feature information, and the different feature information is used as a further image feature to form feature descriptions of different angles of the image on the same layer.

The face detection network is based on SSH network to detect the prior frame, and includes three aspects: target detection (face classification); face box reconstruction (face box reconstruction) and face feature point coordinate detection (face landmark reconstruction), namely, the three detections are respectively carried out on the three feature maps (namely, the face detection network 1 in fig. 3 corresponds to a first feature map, the face detection network 2 corresponds to a second feature map, and the face detection network 3 corresponds to a third feature map).

Target detection (face classification): for detecting whether a face is present in the prior frame. That is, whether the prior frame contains a target or not is judged, and the number of channels of the SSH network is adjusted to be 2 by utilizing a convolution of 1x 1, so as to represent the probability that the prior frame contains a human face; 2 here does not generally represent the probability of the prior box existing the face by using one probability, but represents the probability of the face existing in the prior box by using two values; of the two values, if the first value is larger, it indicates that a face exists, and if the second value is larger, it indicates that no face exists.

Face box adjustment (face box regression): the center and width height of the prior frame are adjusted by four parameters in the SSH network.

Human face feature point coordinate detection (facial landmark regression): the prior frame is adjusted to obtain face key points (namely feature point coordinates), each face key point needs two adjustment parameters, and the total number of the face key points is five. At this time, the channel of the SSH network is adjusted to 5x2 by convolution with 1x 1, which indicates the adjustment of each face key point for each prior frame, 5 being 5 key points on the face, and 2 being a parameter for adjusting the face feature point.

By the design, the extraction of the face in the image to be recognized can be realized, and a face image is obtained.

Next, comparing the face extraction model provided by the embodiment with a traditional Retinaface detection network, wherein the two models both adopt the same data set; selecting the same batch _ size (the number of samples selected in one training); training the same epoch (i.e. training the model once completely by using all the data in the training set (i.e. the number of selected samples)), and comparing the map indexes (i.e. performance indexes) of the face recall ratio under the three difficulties of simplicity, moderate and difficulty, the comparison results are shown in table 1:

TABLE 1

As can be seen from table 1, the face extraction model provided by the present invention extracts a face image under three difficulties, the face recall rate is higher than that of the conventional Retinaface detection network, and the effect of face recognition is obviously better.

As shown in fig. 4, a second aspect of this embodiment provides a hardware apparatus for implementing the method for extracting a face image in the first aspect of the embodiment, including: the method comprises the following steps: the device comprises an acquisition unit and a face image extraction unit.

The acquisition unit is used for acquiring an image to be recognized, wherein the image to be recognized at least comprises a human face.

In one possible design, the face image extraction unit includes: a first channel fusion subunit.

And the first channel fusion subunit is configured to perform first upsampling on the third feature map by using a first upsampling layer, so as to obtain a sixth feature map of the image to be identified.

And the first channel fusion subunit is configured to perform second upsampling on the fourth feature map by using a second upsampling layer, so as to obtain a seventh feature map of the image to be identified.

In one possible design, the apparatus further includes: a first convolution element.

In one possible design, the apparatus further includes: a second convolution unit.

In one possible design, the face image extraction unit further includes: a non-linear conversion subunit.

In one possible design:

and the nonlinear conversion subunit specifically uses a fifth convolution layer to perform fifth convolution processing on the first feature map to obtain a thirteenth feature map of the image to be identified.

And the nonlinear conversion subunit specifically uses a sixth convolution layer to perform sixth convolution processing on the second feature map to obtain a fourteenth feature map of the image to be identified.

And the nonlinear conversion subunit specifically uses a seventh convolution layer to perform seventh convolution processing on the third feature map to obtain a fifteenth feature map of the image to be identified.

The nonlinear conversion subunit specifically uses a first nonlinear conversion layer to perform nonlinear conversion on the thirteenth feature map by using a PReLU activation function, so as to obtain the tenth feature map.

The nonlinear conversion subunit specifically uses a second nonlinear conversion layer and uses a PReLU activation function to perform nonlinear conversion on the fourteenth feature map, so as to obtain the eleventh feature map.

The nonlinear conversion subunit specifically uses a third nonlinear conversion layer and performs nonlinear conversion on the fifteenth feature map by using a PReLU activation function, so as to obtain the twelfth feature map.

In one possible design, the face image extraction unit further includes: and a second channel fusion subunit.

And the second channel fusion subunit is configured to perform third upsampling on the twelfth feature map by using a third upsampling layer, so as to obtain a sixteenth feature map of the image to be identified.

And the second channel fusion subunit is configured to perform channel fusion on the sixteenth feature map and the eleventh feature map by using a second channel fusion layer, so as to obtain a seventeenth feature map of the image to be identified.

And the second channel fusion subunit is configured to perform third convolution processing on the seventeenth feature map by using a third convolution layer to reduce the number of channels of the seventeenth feature map, so as to obtain an eighteenth feature map of the image to be identified.

For the working process, the working details, and the technical effects of the hardware apparatus provided in this embodiment, reference may be made to the first aspect of the embodiment, which is not described herein again.

As shown in fig. 5, a third aspect of this embodiment provides a second hardware device for implementing the facial image extraction method in the first aspect of the embodiment, where the hardware device includes a memory, a processor, and a transceiver, which are sequentially connected in a communication manner, where the memory is used to store a computer program, the transceiver is used to send and receive messages, and the processor is used to read the computer program and execute the facial image extraction method in the first aspect of the embodiment.

For example, the Memory may include, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Flash Memory (Flash Memory), a First In First Out (FIFO), and/or a First In Last Out (FILO), and the like; the processor may not be limited to a microprocessor of STM32F105 series, a reduced instruction set computer (RSIC) microprocessor, an architecture processor such as X86, or a processor integrated with an embedded neural Network Processor (NPU); the transceiver may be, but is not limited to, a wireless fidelity (WIFI) wireless transceiver, a bluetooth wireless transceiver, a General Packet Radio Service (GPRS) wireless transceiver, a ZigBee wireless transceiver (ieee802.15.4 standard-based low power local area network protocol), a 3G transceiver, a 4G transceiver, and/or a 5G transceiver, etc. In addition, the device may also include, but is not limited to, a power module, a display screen, and other necessary components.

A fourth aspect of the present embodiment provides a computer-readable storage medium storing instructions for implementing the facial image extraction method according to the first aspect, that is, the computer-readable storage medium stores instructions that, when executed on a computer, perform the facial image extraction method according to the first aspect. The computer-readable storage medium refers to a carrier for storing data, and may include, but is not limited to, floppy disks, optical disks, hard disks, flash memories, flash disks and/or Memory sticks (Memory sticks), etc., and the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.

For the working process, the working details, and the technical effects of the computer-readable storage medium provided in this embodiment, reference may be made to the first aspect of the embodiment, which is not described herein again.

A fifth aspect of the present embodiment provides a computer program product containing instructions for causing a computer to execute the method for extracting a facial image according to the first aspect of the present embodiment, when the instructions are executed on the computer, wherein the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable apparatus.

The embodiments described above are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device to perform the methods described in the embodiments or some portions of the embodiments.

The invention is not limited to the above alternative embodiments, and any other various forms of products can be obtained by anyone in the light of the present invention, but any changes in shape or structure thereof, which fall within the scope of the present invention as defined in the claims, fall within the scope of the present invention.

Claims

1. A face image extraction method is characterized by comprising the following steps:

the second feature fusion structural layer is used for performing second feature fusion on the first feature map, the third feature map and the fourth feature map to obtain a fifth feature map corresponding to each face in the image to be recognized, so that the face image is extracted from the image to be recognized through the fifth feature map;

the face extraction model further comprises: converting the structural layer in a nonlinear way;

the nonlinear conversion structural layer is configured to perform nonlinear conversion on the first feature map, the second feature map, and the third feature map to obtain a tenth feature map, an eleventh feature map, and a twelfth feature map of the image to be recognized, so that the eleventh feature map and the twelfth feature map are input into the first feature fusion structural layer to perform first feature fusion, so as to obtain the fourth feature map, and the tenth feature map is input into the second feature fusion structural layer to perform second feature fusion with the fourth feature map and the third feature map, so as to obtain the fifth feature map.

2. The method of claim 1, wherein the second feature fusion structure layer comprises: the first up-sampling layer, the second up-sampling layer and the first channel fusion layer;

the second upsampling layer is used for performing second upsampling on the fourth feature map to obtain a seventh feature map of the image to be identified;

the first channel fusion layer is configured to perform channel fusion on the first feature map, the sixth feature map, and the seventh feature map to obtain the fifth feature map.

3. The method of claim 2, wherein the face extraction model further comprises: and the first convolution layer is used for performing first convolution processing on the fifth feature map so as to reduce the number of channels of the fifth feature map and obtain an eighth feature map of the image to be identified.

4. The method of claim 3, wherein the face extraction model further comprises: and the second convolution layer is used for performing second convolution processing on the eighth feature map to obtain a ninth feature map of the image to be recognized, so that the face image is extracted from the image to be recognized through the ninth feature map.

5. The method of claim 3, wherein the first convolution process uses a pointwise convolution operation with a convolution kernel of 1x 1 and a step size of 1.

6. The method of claim 1, wherein the nonlinear conversion structure layer comprises: a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, a first nonlinear conversion layer, a second nonlinear conversion layer, and a third nonlinear conversion layer;

the first nonlinear conversion layer is used for carrying out nonlinear conversion on the thirteenth feature map by using a PReLU activation function to obtain a tenth feature map;

the third nonlinear conversion layer is configured to perform nonlinear conversion on the fifteenth feature map by using a PReLU activation function, so as to obtain the twelfth feature map.

7. A face image extraction device, characterized by comprising: the device comprises an acquisition unit and a face image extraction unit;

the face image extraction unit is used for inputting the image to be recognized into a face extraction model to obtain a face image corresponding to each face in the image to be recognized, wherein the face extraction model comprises a feature map extraction structural layer, a first feature fusion structural layer and a second feature fusion structural layer;

the characteristic map extraction structure layer is used for extracting characteristic information of the image to be identified under three receptive field conditions to respectively obtain a first characteristic map, a second characteristic map and a third characteristic map of the image to be identified;

the face image extraction unit further includes: a nonlinear conversion subunit;

8. A face image extraction device, characterized by comprising: the facial image extraction method comprises a memory, a processor and a transceiver which are sequentially connected in a communication mode, wherein the memory is used for storing a computer program, the transceiver is used for transmitting and receiving messages, and the processor is used for reading the computer program and executing the facial image extraction method according to any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that: the computer-readable storage medium has stored thereon instructions that, when executed on a computer, perform the face image extraction method according to any one of claims 1 to 6.