CN113095310B

CN113095310B - Face position detection method, electronic device and storage medium

Info

Publication number: CN113095310B
Application number: CN202110646494.0A
Authority: CN
Inventors: 叶小培; 王月平
Original assignee: Hangzhou Moredian Technology Co ltd
Current assignee: Hangzhou Moredian Technology Co ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-09-07
Anticipated expiration: 2041-06-10
Also published as: CN113095310A

Abstract

The embodiment of the application relates to a face position detection method, electronic equipment and a storage medium, belonging to the technical field of face recognition, wherein the method comprises the following steps: extracting a first feature vector from the collected face picture by adopting a full connection layer, and splicing the first feature vector into a feature vector matrix; the characteristic vector matrix is divided through a preset division strategy and sent to a specific cross-domain full-connection layer to obtain a target vector matrix; inputting the target vector matrix into the global average pooling layer line by line, and extracting a second feature vector through the full-connection layer; respectively sending the second feature vectors into a confidence coefficient full-link layer and a detection position full-link layer to obtain a confidence coefficient vector and a face position vector; and calculating the face position according to the face position vector and the confidence coefficient vector. According to the embodiment of the application, the detected face position is high in precision, the calculated amount is reduced, the operation speed is high, and the method can be operated on equipment with weak operation capability in real time.

Description

Face position detection method, electronic device and storage medium

Technical Field

The present application relates to the field of face recognition technologies, and in particular, to a face position detection method, an electronic device, and a storage medium.

Background

With the increasing maturity of AI product related technologies, the application of face detection technology in many fields (such as entrance guard, attendance, finance, and people's identity, etc.) is more and more extensive. Therefore, an effective face detection technology is becoming more and more important, wherein the face position detection refers to identifying the relative position of a face from a face picture acquired from an imaging device and filtering out non-face data, and the input form of the face data mainly includes a photo, a video and the like.

At present, the position of a human face is mainly detected through an attention mechanism neural network and a convolution neural network, but the two neural networks (especially a large convolution neural network) have a large operation amount during operation, so that the two neural networks cannot be operated on equipment with weak computing power (such as access control equipment) in real time, and for this reason, the related technologies have not been effectively solved.

Disclosure of Invention

The embodiment of the application provides a face position detection method, electronic equipment and a storage medium, so as to at least solve the problem that the operation amount of a neural network in the related technology is large, so that the real-time operation cannot be performed on equipment with weak computing capability.

In a first aspect, an embodiment of the present application provides a face position detection method, including: extracting a first feature vector from the collected face picture by adopting a full connection layer, and splicing the first feature vector into a feature vector matrix; the characteristic vector matrix is segmented through a preset segmentation strategy and sent to a specific cross-domain full-connection layer to obtain a target vector matrix; inputting the target vector matrix into a global average pooling layer line by line, and extracting a second feature vector through a full-connection layer; respectively sending the second feature vectors into a confidence coefficient full-link layer and a detection position full-link layer to obtain a confidence coefficient vector and a human face position vector; and calculating the face position according to the face position vector and the confidence coefficient vector.

In some of these embodiments, the particular cross-domain fully-connected layer includes a cross-channel domain fully-connected layer and a cross-spatial domain fully-connected layer.

In some embodiments, the extracting the first feature vector from the acquired face picture by using a full connection layer, and splicing the first feature vector into a feature vector matrix includes: normalizing the collected face picture to a picture with a preset size; cutting the picture with the preset size into a preset number of block pictures with the same size; expanding each block diagram into a one-dimensional vector, and inputting the one-dimensional vector into a full connection layer to obtain the preset number of first feature vectors with the specified length; and longitudinally splicing the first eigenvectors of the preset number into an eigenvector matrix.

In some embodiments, when the eigenvector matrix is represented as a first matrix [ n, t ], where n represents the number of features and t represents the length of features, the segmenting the eigenvector matrix by a preset segmentation strategy and sending the segmented eigenvector matrix to a specific cross-domain full-link layer to obtain the target vector matrix includes:

step 1: transposing the first matrix [ n, t ] to obtain a first transpose matrix [ t, n ]; splitting the first transpose matrix [ t, n ] into t vectors with the length of n, sending the vectors into a cross-channel domain full-connection layer, and longitudinally splicing output results to obtain a second transpose matrix [ t, n ]; transposing the second transpose matrix [ t, n ] to obtain a second matrix [ n, t ]; adding the second matrix [ n, t ] and the first matrix [ n, t ] to obtain a third matrix [ n, t ];

step 2: splitting the third matrix [ n, t ] into n vectors with the length of t, sending the vectors into a cross-space domain full-connection layer, sending the vectors into an activation function Relu, and longitudinally splicing output results to obtain a fourth matrix [ n, t ]; adding the fourth matrix [ n, t ] and the third matrix [ n, t ] to obtain a fifth matrix [ n, t ];

and step 3: and circularly executing the step 1 to the step 2 by taking the fifth matrix [ n, t ] as the first matrix [ n, t ], and obtaining a target vector matrix when the preset cycle number is reached.

In some embodiments, the inputting the target vector matrix into the global average pooling layer row by row and extracting the second feature vector through the full-connected layer includes: and calculating the average value of the target vector matrix by adopting a global average pooling layer, obtaining a one-dimensional vector with fixed length, sending the one-dimensional vector into a full-connection layer, and outputting a second characteristic vector.

In some embodiments, where the length of the face location vector is n × 9 × 4 and the length of the confidence vector is n × 9 × 2, the calculating the face location from the face location vector and the confidence vector comprises: recombining the face position vector into a three-dimensional matrix [ n,9,4], and recombining the confidence coefficient vector into a three-dimensional matrix [ n,9,2 ]; decoding the third dimension of the three-dimensional matrix [ n,9,4] to calculate a face coordinate frame; performing softmax transformation on the third dimension of the three-dimensional matrix [ n,9,2] to calculate a confidence coefficient; and calculating the face position according to the face coordinate frame and the confidence coefficient.

In some embodiments, said calculating a face position from said face position vector and said confidence vector comprises: calculating a confidence coefficient and a face coordinate frame according to the confidence coefficient vector and the face position vector; and selecting the face coordinate frame with the confidence coefficient larger than the threshold value to obtain the position of the detected face.

In some embodiments, after the selecting the face coordinate box with the confidence level greater than the threshold value, the method further includes: and adopting a non-maximum value to inhibit NMS from merging the face coordinate frames for the selected face coordinate frames.

In a second aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform any one of the methods described above.

In a third aspect, an embodiment of the present application provides a storage medium, in which a computer program is stored, where the computer program is configured to execute any one of the above methods when the computer program runs.

Compared with the method for detecting the face position through the attention mechanism neural network and the convolution neural network, the method is large in calculation amount and cannot run on equipment with weak calculation capability in real time, and the accuracy of detection results in the related technology is low. In addition, the feature vector matrix is segmented through a preset segmentation strategy and then sent to a specific cross-domain full-connection layer, so that feature information of each aspect of each dimension is efficiently utilized, in addition, not only is a global average pooling layer used, but also a confidence coefficient full-connection layer and a detection position full-connection layer are used, the face position is calculated according to the face position vector and the confidence coefficient vector, and the accuracy of a detection result is greatly improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a face position detection method according to an embodiment of the present application;

fig. 2 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference herein to "a plurality" means greater than or equal to two. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

Because the neural networks are built through the convolutional layers or the attention mechanism in the related technology to detect the positions of the human faces, the neural networks cannot run on equipment with weak computing power in real time due to large computation amount. Therefore, the embodiment of the present application provides a face position detection method, which detects a face position through a cross-domain fully-connected deep learning network model (may be referred to as a "model" for short), where the cross-domain fully-connected deep learning network model includes a specific cross-domain fully-connected layer, a global average pooling layer, a confidence fully-connected layer, and a detection position fully-connected layer, and the specific cross-domain fully-connected layer includes a cross-channel domain fully-connected layer and a cross-space domain fully-connected layer. Therefore, the embodiment of the application does not use any convolution layer and attention layer, but mainly uses a full connection layer, has the advantages of small volume, high operation speed and high detection result precision, and has the capability of being adapted to various mainstream neural network acceleration chips in the market.

As an example, fig. 1 is a flowchart of a face position detection method according to an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

s101: extracting a first feature vector from the collected face picture by adopting a full connection layer, and splicing the first feature vector into a feature vector matrix;

s102: the characteristic vector matrix is divided through a preset division strategy and sent to a specific cross-domain full-connection layer to obtain a target vector matrix;

s103: inputting the target vector matrix into the global average pooling layer line by line, and extracting a second feature vector through the full-connection layer;

s104: respectively sending the second feature vectors into a confidence coefficient full-link layer and a detection position full-link layer to obtain a confidence coefficient vector and a face position vector;

s105: and calculating the face position according to the face position vector and the confidence coefficient vector.

According to the content, a convolutional layer and an attention layer are not needed, the cross-domain full-connection deep learning network model of the embodiment of the application can be used for rapidly detecting the position of the face, meanwhile, the calculation amount is reduced, the requirement on the operation capacity of equipment is lowered, the economic cost can be saved, and the cross-domain full-connection deep learning network model can be operated on equipment with weak calculation capacity (such as entrance guard equipment). In addition, the feature vector matrix is segmented through a preset segmentation strategy and then sent to a specific cross-domain full-connection layer, so that feature information of each aspect of each dimension is efficiently utilized, in addition, not only is a global average pooling layer used, but also a confidence coefficient full-connection layer and a detection position full-connection layer are used, the face position is calculated according to the face position vector and the confidence coefficient vector, and the accuracy of a detection result is greatly improved.

In order to make the embodiments of the present application clearer, the following examples are given for illustrative purposes.

Preparation phase of model

In the embodiment of the application, in order to enable the effect, such as speed and precision, of the cross-domain full-connection deep learning network model to meet the requirements, a large number of face pictures are required to be used as samples to train the model, and preferably, the face pictures are color face pictures; the samples comprise positive samples and negative samples, wherein the positive samples are pictures which are shot at different angles, light rays and distances and contain human faces, and the negative samples are pictures which are shot at different angles, light rays and distances and do not contain human faces and/or pictures which are extremely similar to the human faces.

After the sample is obtained, data cleaning and data preprocessing are carried out, and optionally, the sample data set is divided into a training set and a test set according to the ratio of 7: 3.

As an example, an original color face picture is normalized to a picture with a preset size, for example, 512 × 512 × 3 pixels, and then is cut into a preset number n of square block maps with the same size, so that each block map has a size of n

And expanding each block image into a one-dimensional vector m so as to obtain n one-dimensional vectors m.

Inputting the n one-dimensional vectors m into a full-connection layer, and performing linear transformation to obtain n one-dimensional feature vectors (namely the first feature vectors) with specified length t; and longitudinally splicing the n one-dimensional feature vectors with the specified length of t into a two-dimensional feature matrix of [ n, t ] to obtain a feature vector matrix, wherein the feature vector matrix can be expressed as a first matrix [ n, t ].

Training phase of model

(1) Transposing the first matrix [ n, t ] to obtain a first transpose matrix [ t, n ]; splitting the first transpose matrix [ t, n ] into t one-dimensional vectors with the length of n, sending the t one-dimensional vectors into a cross-channel domain full-connection layer, and longitudinally splicing the t one-dimensional vectors with the length of n of an output result to obtain a second transpose matrix [ t, n ]; transposing the second transpose matrix [ t, n ] to obtain a second matrix [ n, t ]; adding the second matrix [ n, t ] and the first matrix [ n, t ] pixel by pixel to obtain a third matrix [ n, t ];

(2) splitting the third matrix [ n, t ] into n one-dimensional vectors with the length of t, respectively sending the n one-dimensional vectors with the length of t to a cross-space domain full-connection layer, then sending the output n one-dimensional vectors with the length of t to an activation function Relu, and then longitudinally splicing the output results of the activation function to obtain a fourth matrix [ n, t ]; adding the fourth matrix [ n, t ] and the third matrix [ n, t ] to obtain a fifth matrix [ n, t ];

(3) taking the fifth matrix [ n, t ] as the first matrix [ n, t ] to circularly execute the steps (1) to (2), and when a preset circulation frequency (for example, four times) is reached, obtaining a target vector matrix [ n ', t' ];

(4) inputting the target vector matrix [ n ', t' ] into the global average pooling layer line by line, and solving the average value of each line to obtain a one-dimensional vector v with fixed size;

(5) sending the one-dimensional vector v into a full-connection layer to obtain a second characteristic vector v';

(6) and sending the second feature vector v' into the confidence coefficient full-connected layer and the detection position full-connected layer respectively to obtain a face position vector v1 and a confidence coefficient vector v2, wherein the data digit of the face position vector v1 is n × 9 × 4, and the data length of the confidence coefficient vector v2 is n × 9 × 2. Recombining the data shape of the face position vector v1 into a three-dimensional matrix m1(n,9, 4); the data shapes of the confidence vector v2 are regrouped into a three-dimensional matrix m2(n,9, 2). Decoding the third dimension of the three-dimensional matrix m1 to calculate a face coordinate frame; the softmax transformation is performed on the third dimension of the three-dimensional matrix m2 to calculate a face confidence (which may be simply referred to as "confidence"). Then, selecting a face coordinate frame with the confidence coefficient larger than a threshold value, preferably, adopting a non-maximum value to inhibit NMS from merging the face coordinate frame to obtain a detected face position;

(7) and calculating a loss function according to the detected face position, and obtaining a trained model when the loss function is converged, wherein the loss function comprises a category loss function and a position loss function.

The formula for the class loss function is:

represents the adjustment coefficient, and the final value is, for example, 0.35; γ represents a power coefficient, and the final value is, for example, 2.16;

indicating the output of the class j tag.

The formula for the position loss function is:

x represents the difference between the true and predicted values of the decoded position.

It should be noted that, the model is trained based on the training set, and the depth and width of the model, and the related parameters in the model are continuously adjusted. On the premise of meeting the precision, the smaller the model calculation amount is, the better the model calculation amount is.

Application phase of model

The trained model is transplanted to an equipment end (such as an entrance guard end), and the face position of the picture sent by the camera is detected in real time, and the method specifically comprises the following steps:

normalizing a color face picture acquired by a camera to a picture with the size of 512 multiplied by 3 pixels, and then cutting the picture into n square block pictures with the same size, wherein the size of each block picture is

Pixel, then expanding each block image into a one-dimensional vector m to obtain n one-dimensional vectors m;

inputting the n one-dimensional vectors m into a full-connection layer, and performing linear transformation to obtain n one-dimensional characteristic vectors with specified length t; longitudinally splicing the n one-dimensional feature vectors with the specified length of t into a two-dimensional feature matrix of [ n, t ] to obtain a feature vector matrix, wherein the feature vector matrix can be expressed as a first matrix [ n, t ];

transposing the first matrix [ n, t ] to obtain a first transpose matrix [ t, n ]; splitting the first transpose matrix [ t, n ] into t one-dimensional vectors with the length of n, sending the t one-dimensional vectors into a cross-channel domain full-connection layer, and longitudinally splicing the t one-dimensional vectors with the length of n of an output result to obtain a second transpose matrix [ t, n ]; transposing the second transpose matrix [ t, n ] to obtain a second matrix [ n, t ]; adding the second matrix [ n, t ] and the first matrix [ n, t ] pixel by pixel to obtain a third matrix [ n, t ];

fourthly, splitting the third matrix [ n, t ] into n one-dimensional vectors with the length of t, respectively sending the n one-dimensional vectors with the length of t to a cross-space domain full-connection layer, then sending the output n one-dimensional vectors with the length of t to an activation function Relu, and then longitudinally splicing the output results of the activation function to obtain a fourth matrix [ n, t ]; adding the fourth matrix [ n, t ] and the third matrix [ n, t ] to obtain a fifth matrix [ n, t ];

taking the fifth matrix [ n, t ] as the first matrix [ n, t ] to circularly execute the steps from the third to the fourth, and obtaining a target vector matrix [ n ', t' ] when the circulation times reach four times;

sixthly, inputting the target vector matrix [ n ', t ') into the global average pooling layer line by line, calculating the average value of each line to obtain a one-dimensional vector v with fixed size, and sending the one-dimensional vector v into the full-connection layer to obtain a second eigenvector v ';

and seventhly, respectively sending the second feature vector v' into a confidence coefficient full-connection layer and a detection position full-connection layer to obtain a face position vector v1 and a confidence coefficient vector v2, wherein the data bit number of the face position vector v1 is n × 9 × 4, and the data length of the confidence coefficient vector v2 is n × 9 × 2. Recombining the data shape of the face position vector v1 into a three-dimensional matrix m1(n,9, 4); the data shapes of the confidence vector v2 are regrouped into a three-dimensional matrix m2(n,9, 2). Decoding the third dimension of the three-dimensional matrix m1 to calculate a face coordinate frame; performing softmax transformation on the third dimension of the three-dimensional matrix m2 to calculate a face confidence coefficient (which can be simply referred to as a "confidence coefficient");

and selecting the face coordinate frame with the confidence degree larger than the threshold value, preferably, adopting a non-maximum value to inhibit NMS to merge the face coordinate frame to obtain the detected face position.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with the face position detection method in the foregoing embodiment, the embodiment of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the face position detection methods in the above embodiments.

Therefore, it is considered that the computing capability of edge devices such as gate controllers and gate switches is much weaker than that of computers and mobile phones, and mainstream neural network acceleration chips in the market do not support attention mechanisms, convolutional layers and the like to realize hardware acceleration. Therefore, compared with a model based on a convolutional layer and an attention mechanism, the embodiment of the application mainly uses a full-connection layer, does not use any convolutional layer and attention layer, has the advantages of small volume, high speed and high precision, and has the capability of adapting to various mainstream neural network acceleration chips in the market; and the characteristic vector matrix is divided by different dividing modes and then sent to a specific cross-domain full-connection layer, so that the purposes of data interaction and utilization of cross-channel domains and cross-space domains are achieved, the characteristic information of each aspect of each dimension is efficiently utilized, and the precision of a detection result can be improved.

An embodiment of the present application also provides an electronic device, which may be a terminal. The electronic device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a face position detection method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 2 is a block diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 2, an electronic device is provided, where the electronic device may be a server, and its internal structure diagram may be as shown in fig. 2. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize the face position detection method, and the database is used for storing data.

Those skilled in the art will appreciate that the architecture shown in fig. 2 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A face position detection method is characterized by comprising the following steps:

normalizing the collected face picture to a picture with a preset size;

cutting the picture with the preset size into a preset number of block pictures with the same size;

expanding each block diagram into a one-dimensional vector, and inputting the one-dimensional vector into a full connection layer to obtain the preset number of first feature vectors with the specified length;

longitudinally splicing the first eigenvectors of the preset number into an eigenvector matrix;

the characteristic vector matrix is segmented through a preset segmentation strategy and sent to a specific cross-domain full-connection layer to obtain a target vector matrix, and the specific cross-domain full-connection layer comprises a cross-channel domain full-connection layer and a cross-space domain full-connection layer;

inputting the target vector matrix into a global average pooling layer line by line, and extracting a second feature vector through a full-connection layer;

respectively sending the second feature vectors into a confidence coefficient full-link layer and a detection position full-link layer to obtain a confidence coefficient vector and a human face position vector;

and calculating the face position according to the face position vector and the confidence coefficient vector.

2. The method according to claim 1, wherein in a case that the eigenvector matrix is represented as a first matrix [ n, t ], n represents the number of features, and t represents the length of the features, the segmenting the eigenvector matrix by a preset segmentation strategy and sending the segmented eigenvector matrix into a specific cross-domain full-connected layer to obtain the target vector matrix comprises:

3. The method of claim 1, wherein the inputting the target vector matrix row by row into a global average pooling layer and extracting the second feature vector through a full-connected layer comprises:

and calculating the average value of the target vector matrix by adopting a global average pooling layer, obtaining a one-dimensional vector with fixed length, sending the one-dimensional vector into a full-connection layer, and outputting a second characteristic vector.

4. The method of claim 1, wherein said calculating face locations from said face location vectors and said confidence vectors, where said face location vectors have a length of n x 9 x 4 and said confidence vectors have a length of n x 9 x 2, comprises:

recombining the face position vector into a three-dimensional matrix [ n,9,4], and recombining the confidence coefficient vector into a three-dimensional matrix [ n,9,2 ];

decoding the third dimension of the three-dimensional matrix [ n,9,4] to calculate a face coordinate frame;

performing softmax transformation on the third dimension of the three-dimensional matrix [ n,9,2] to calculate a confidence coefficient;

and calculating the face position according to the face coordinate frame and the confidence coefficient.

5. The method of claim 1, wherein said computing a face position from said face position vector and said confidence vector comprises:

calculating a confidence coefficient and a face coordinate frame according to the confidence coefficient vector and the face position vector;

and selecting the face coordinate frame with the confidence coefficient larger than the threshold value to obtain the position of the detected face.

6. The method of claim 5, wherein after said selecting the face coordinate box with the confidence level greater than the threshold, the method further comprises:

and adopting a non-maximum value to inhibit NMS from merging the face coordinate frames for the selected face coordinate frames.

7. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.

8. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any one of claims 1 to 6 when executed.